(2020) https://arxiv.org/abs/2010.11929 : an image is worth 16x16 words transformers for image recognition at scale
(2021) https://arxiv.org/abs/2103.13915 : An Image is Worth 16x16 Words, What is a Video Worth?
(2024) https://arxiv.org/abs/2406.07550 : An Image is Worth 32 Tokens for Reconstruction and Generation
Those are indeed 3 papers.