Abstract:Large Language Models (LLMs) have made the ambitious quest for generalist agents significantly far from being a fantasy. A key hurdle for building such general models is the diversity and heterogeneity of tasks and modalities. A promising solution is unification, allowing the support of a myriad of tasks and modalities within one unified framework. While few large models (e.g., Flamingo (Alayrac et al., 2022), trained on massive datasets, can support more than two modalities, current small to mid-scale unified models are still limited to 2 modalities, usually image-text or video-text. The question that we ask is: is it possible to build efficiently a unified model that can support all modalities? To answer this, we propose UnIVAL, a step further towards this ambitious goal. Without relying on fancy datasets sizes or models with billions of parameters, the ~ 0.25B parameter UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model. Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning. UnIVAL shows competitive performance to existing state-of-the-art approaches, across image and video-text tasks. The feature representations learned from image and video-text modalities, allows the model to achieve competitive performance when finetuned on audio-text tasks, despite not being pretrained on audio. Thanks to the unified model, we propose a novel study on multimodal model merging via weight interpolation of models trained on different multimodal tasks, showing their benefits in particular for out-of-distribution generalization. Finally, we motivate unification by showing the synergy between tasks. The model weights and code are released here: <a class="link-external link-https" href="https://github.com/mshukor/UnIVAL" rel="external noopener nofollow">this https URL</a>.

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Unified Vision-Language Pre-Training for Image Captioning and VQA

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UNITER: UNiversal Image-TExt Representation Learning

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks

Multimodal Pre-training Method for Vision-language Understanding and Generation.

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

UnIVAL: Unified Model for Image, Video, Audio and Language Tasks