Abstract:Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection module (UP) to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in cross - modal alignment learning: 1. **High computational cost**: Current methods rely on large - scale modality - specific encoders, which need to be trained or fine - tuned on huge alignment datasets. This results in extremely high computational costs. 2. **Difficulty in obtaining large - scale alignment datasets**: It is very difficult to construct and obtain large - scale multi - modal alignment datasets (such as text - image, text - audio, image - audio, etc.), especially in the case of specific modality combinations. 3. **High cost of adding new modalities**: Whenever a new modality is introduced, the existing framework usually needs to retrain the entire model, which is not only time - consuming but also increases the demand for computational resources. To solve these problems, the authors propose **OneEncoder**, a lightweight framework that can gradually represent and align four modalities (image, text, audio, video). Specifically, the main features of OneEncoder include: - **Lightweight design**: By using a pre - trained lightweight universal projection module (Universal Projection, UP), the need to train large - scale modality - specific encoders from scratch is avoided. - **Gradual alignment**: First, train the UP module to align the image and text modalities, then freeze the pre - trained UP and gradually align future new modalities (such as audio and video), only need to train the compact alignment layer (Alignment Layer, AL). - **Efficient and low - cost**: Even in the absence of a large number of alignment datasets, OneEncoder can operate efficiently and perform well in tasks such as classification, query, and visual question answering, surpassing methods that rely on large - scale datasets and specialized encoders. Through these improvements, OneEncoder provides a more cost - effective way to achieve multi - modal alignment, reduces the dependence on large - scale alignment datasets, and simplifies the integration process of new modalities.

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

OneLLM: One Framework to Align All Modalities with Language

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Everything is a Video: Unifying Modalities through Next-Frame Prediction

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Large Multi-modal Encoders for Recommendation

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Do Vision and Language Encoders Represent the World Similarly?

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

Multimodal Autoregressive Pre-training of Large Vision Encoders

Cross‐modal fusion encoder via graph neural network for referring image segmentation

Advancing Multi-Modal Sensing Through Expandable Modality Alignment

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Aligned with LLM: a new multi-modal training paradigm for encoding fMRI activity in visual cortex

OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Meta-Transformer: A Unified Framework for Multimodal Learning