PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer,Andreas Steiner,André Susano Pinto,Alexander Kolesnikov,Xiao Wang,Daniel Salz,Maxim Neumann,Ibrahim Alabdulmohsin,Michael Tschannen,Emanuele Bugliarello,Thomas Unterthiner,Daniel Keysers,Skanda Koppula,Fangyu Liu,Adam Grycner,Alexey Gritsenko,Neil Houlsby,Manoj Kumar,Keran Rong,Julian Eisenschlos,Rishabh Kabra,Matthias Bauer,Matko Bošnjak,Xi Chen,Matthias Minderer,Paul Voigtlaender,Ioana Bica,Ivana Balazevic,Joan Puigcerver,Pinelopi Papalampidi,Olivier Henaff,Xi Xiong,Radu Soricut,Jeremiah Harmsen,Xiaohua Zhai

2024-10-11

Abstract:PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The main problem this paper attempts to address is the construction of a versatile and efficient Vision-Language Model (VLM), namely PaliGemma. PaliGemma is an open-source model based on the SigLIP vision encoder and the Gemma-2B language model, designed to excel in various tasks through transfer learning. Specifically, the goals of the paper include: 1. **Constructing a versatile foundational model**: PaliGemma is designed as a versatile foundational model capable of performing well in a wide range of open-world tasks, including standard VLM benchmarks, remote sensing VQA, counting tasks, video captioning, and question answering. 2. **Optimizing model performance**: Despite having fewer than 3 billion parameters, PaliGemma's performance can rival that of larger models such as PaLI-X and PaLM-E. The paper ensures the model's efficiency and accuracy across multiple tasks through carefully designed pre-training and fine-tuning strategies. 3. **Exploring pre-training and fine-tuning strategies**: The paper discusses in detail the pre-training strategies at different stages, including unimodal pre-training, multimodal pre-training, resolution increase, and task transfer. These strategies aim to better adapt the model to different task requirements and perform well in practical applications. 4. **Validating the model's transferability**: To validate PaliGemma's transferability, the paper fine-tunes and evaluates it on over 30 academic benchmarks. The results show that PaliGemma achieves good performance across various tasks, particularly excelling in high-resolution tasks. Through these goals, the paper aims to provide a new, efficient, and versatile tool for the research and application of vision-language models.

PaliGemma: A versatile 3B VLM for transfer

PaliGemma 2: A Family of Versatile VLMs for Transfer

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Gemma: Open Models Based on Gemini Research and Technology

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

Gemma 2: Improving Open Language Models at a Practical Size

CodeGemma: Open Code Models Based on Gemma

Gemini vs GPT-4V: A Preliminary Comparison and Combination of Vision-Language Models Through Qualitative Cases

Gemini: A Family of Highly Capable Multimodal Models

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Beyond Human Vision: The Role of Large Vision Language Models in Microscope Image Analysis

CogVLM2: Visual Language Models for Image and Video Understanding

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models