PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer,Andreas Steiner,André Susano Pinto,Alexander Kolesnikov,Xiao Wang,Daniel Salz,Maxim Neumann,Ibrahim Alabdulmohsin,Michael Tschannen,Emanuele Bugliarello,Thomas Unterthiner,Daniel Keysers,Skanda Koppula,Fangyu Liu,Adam Grycner,Alexey Gritsenko,Neil Houlsby,Manoj Kumar,Keran Rong,Julian Eisenschlos,Rishabh Kabra,Matthias Bauer,Matko Bošnjak,Xi Chen,Matthias Minderer,Paul Voigtlaender,Ioana Bica,Ivana Balazevic,Joan Puigcerver,Pinelopi Papalampidi,Olivier Henaff,Xi Xiong,Radu Soricut,Jeremiah Harmsen,Xiaohua Zhai
2024-10-11
Abstract:PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The main problem this paper attempts to address is the construction of a versatile and efficient Vision-Language Model (VLM), namely PaliGemma. PaliGemma is an open-source model based on the SigLIP vision encoder and the Gemma-2B language model, designed to excel in various tasks through transfer learning. Specifically, the goals of the paper include: 1. **Constructing a versatile foundational model**: PaliGemma is designed as a versatile foundational model capable of performing well in a wide range of open-world tasks, including standard VLM benchmarks, remote sensing VQA, counting tasks, video captioning, and question answering. 2. **Optimizing model performance**: Despite having fewer than 3 billion parameters, PaliGemma's performance can rival that of larger models such as PaLI-X and PaLM-E. The paper ensures the model's efficiency and accuracy across multiple tasks through carefully designed pre-training and fine-tuning strategies. 3. **Exploring pre-training and fine-tuning strategies**: The paper discusses in detail the pre-training strategies at different stages, including unimodal pre-training, multimodal pre-training, resolution increase, and task transfer. These strategies aim to better adapt the model to different task requirements and perform well in practical applications. 4. **Validating the model's transferability**: To validate PaliGemma's transferability, the paper fine-tunes and evaluates it on over 30 academic benchmarks. The results show that PaliGemma achieves good performance across various tasks, particularly excelling in high-resolution tasks. Through these goals, the paper aims to provide a new, efficient, and versatile tool for the research and application of vision-language models.