Abstract:In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the limitations of existing vision - language pre - training models when dealing with long - text input and multimodal data. Specifically: 1. **Long - text processing ability**: Existing autoregressive vision - language models (such as Flamingo and Palm - E) perform well in few - shot text generation tasks, but face challenges in alignment tasks. These models usually rely on the long - context ability of large - language models, but are not always able to effectively handle long - text input. 2. **Multimodal data processing**: Existing models have deficiencies when processing multimodal data (i.e., data containing both images and text simultaneously). In particular, these models perform less well in classification tasks than contrastive - learning models (such as CLIP and CoCa). 3. **Lack of high - quality datasets**: Although existing multimodal datasets (such as CC3M, LAION400M and DataComp1B) provide a large number of image - text pairs, these datasets are mainly composed of short texts and are of uneven quality. High - quality long - text video datasets are relatively scarce, which limits the performance improvement of models in complex tasks. To solve these problems, the paper introduces the COntrastive - Streamlined MultimOdal framework (CosMo) and proposes a new high - quality video - text dataset Howto - Interlink7M. The following are the main contributions of the paper: 1. **New architecture CosMo**: By dividing the large - language model (LLM) into specialized unimodal text - processing components and multimodal data - processing components and introducing contrastive loss, the performance of the model in processing long - text and multimodal data is improved. CosMo significantly improves the model's performance in multiple downstream tasks while reducing the learnable parameters. 2. **High - quality dataset Howto - Interlink7M**: This dataset is extracted from the Howto100M dataset, and detailed video annotations are generated by GPT - 4, including automatic speech recognition (ASR), subtitles and dense subtitles. Howto - Interlink7M not only provides longer and more detailed text descriptions, but also enhances the alignment between video content and text. 3. **Performance improvement**: Experimental results show that CosMo performs well in multiple image - text and video - text benchmark tests. In particular, in the 4 - shot Flickr captioning task, the performance is improved from 57.2% to 65.1%. Compared with OpenFlamingo, CosMo has achieved significant performance advantages while using fewer samples and data. Through these innovations, the paper aims to promote the development of vision - language pre - training technology, enabling it to better handle complex multimodal data and long - text input.

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

CapsFusion: Rethinking Image-Text Data at Scale

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

VideoCon: Robust Video-Language Alignment via Contrast Captions

TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

Distilling Vision-Language Models on Millions of Videos

Adaptively Building a Video-language Model for Video Captioning and Retrieval Without Massive Video Pretraining

Leveraging Vision-Language Pre-Trained Model and Contrastive Learning for Enhanced Multimodal Sentiment Analysis

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception