logoalt Hacker News

yorwbalast Saturday at 1:45 PM0 repliesview on HN

They seem to have used a fixed input resolution for each model, so the learnable 1D position embeddings are equivalent to learnable 2D position embeddings where every grid position gets its own embedding. It's when different images may have a different number of tokens per row that the correspondence between 1D index and 2D position gets broken and a 2D-aware position embedding can be expected to produce different results.