Abstract:Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. In contrast, AL pretraining benefits less from sentence embedding training, which may result from the limited amount of pretraining data. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper mainly explores and attempts to improve the language encoders in contrastive cross - modal models (such as CLIP and CLAP). Specifically, the author focuses on the following key issues: 1. **Improvement of the quality of language encoders**: - Existing contrastive cross - modal models perform well in visual - language (VL) and audio - language (AL) tasks, but the research and improvement of language encoders are relatively limited. Therefore, the author hopes to evaluate the impact of unsupervised and supervised sentence embedding training on the quality of language encoders. 2. **Optimization of cross - modal task performance**: - The author hopes to further improve the performance of these models in cross - modal tasks (such as zero - shot image retrieval, zero - shot audio classification, etc.) by improving the language encoders. 3. **Differences in the effects of different pre - training datasets**: - Visual - language pre - training usually depends on large - scale datasets, while audio - language pre - training faces the problem of data scarcity. The author studies the effectiveness of sentence embedding training in these two scenarios and analyzes their different impacts on cross - modal task performance. 4. **Analysis of the characteristics of the representation space**: - The author analyzes the alignment and uniformity of the representation space learned through sentence embedding training to understand its specific impact on cross - modal tasks. ### Main contributions - **Extensive evaluation of the impact of sentence embedding training**: The author systematically evaluates the impact of unsupervised and supervised sentence embedding training on visual - language and audio - language contrastive pre - training. The experimental results show that the performance of visual - language tasks has been improved, while the improvement effect of audio - language tasks is relatively limited and unstable. - **Improvement of the CyCLIP model**: The author finds that unsupervised sentence embedding training can significantly improve the quality of the language encoder of the CyCLIP model, thereby improving the performance of cross - modal tasks. - **In - depth analysis of the representation space**: Through the analysis of the alignment and uniformity of the learned representation space, the author finds that sentence embedding training improves the uniformity of the text space but reduces the cross - modal alignment. ### Experimental setup and results - **Datasets**: - The visual - language model is pre - trained using the Conceptual Captions dataset and evaluated for zero - shot image - text retrieval on Flickr30K and MSCOCO. - The audio - language model is pre - trained using the Clotho and AudioCaps datasets and evaluated for zero - shot audio classification on ESC50 and UrbanSound8K. - **Model architectures**: - The visual - language model adopts ResNet - 50 as the image encoder and Transformer as the language encoder. - The audio - language model adopts the pre - trained RoBERTa - base as the language encoder and HTSAT as the audio encoder. - **Experimental results**: - In visual - language tasks, unsupervised sentence embedding training significantly improves zero - shot retrieval performance, especially in text retrieval. - In audio - language tasks, the effect of sentence embedding training is not as obvious as that in visual - language tasks, and there is a certain degree of fluctuation. Through these studies, the author provides an important reference and direction for future improvement of language encoders in cross - modal contrastive learning.

On the Language Encoder of Contrastive Cross-modal Models

Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings

How Much Can CLIP Benefit Vision-and-Language Tasks?

On the Difference of BERT-style and CLIP-style Text Encoders

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

Finetuning CLIP to Reason about Pairwise Differences

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning.

Embedding Geometries of Contrastive Language-Image Pre-Training

On Erroneous Agreements of CLIP Image Embeddings

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Contrastive Localized Language-Image Pre-Training

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts