OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

Bilal Faye,Hanane Azzag,Mustapha Lebbah
2024-09-18
Abstract:Cross-modal alignment Learning integrates information from different modalities like text, image, audio and video to create unified models. This approach develops shared representations and learns correlations between modalities, enabling applications such as visual question answering and audiovisual content analysis. Current techniques rely on large modality-specific encoders, necessitating fine-tuning or training from scratch on vast aligned datasets (e.g., text-image, text-audio, image-audio). This approach has limitations: (i) it is very expensive due to the need for training large encoders on extensive datasets, (ii) acquiring aligned large paired datasets is challenging, and (iii) adding new modalities requires retraining the entire framework to incorporate these modalities. To address these issues, we propose OneEncoder, a lightweight framework that progressively represents and aligns four modalities (image, text, audio, video). Initially, we train a lightweight Universal Projection module (UP) to align image and text modalities. Then, we freeze the pretrained UP and progressively align future modalities to those already aligned. OneEncoder operates efficiently and cost-effectively, even in scenarios where vast aligned datasets are unavailable, due to its lightweight design. Trained on small paired datasets, it shows strong performance in tasks like classification, querying, and visual question answering, surpassing methods that rely on large datasets and specialized encoders.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in cross - modal alignment learning: 1. **High computational cost**: Current methods rely on large - scale modality - specific encoders, which need to be trained or fine - tuned on huge alignment datasets. This results in extremely high computational costs. 2. **Difficulty in obtaining large - scale alignment datasets**: It is very difficult to construct and obtain large - scale multi - modal alignment datasets (such as text - image, text - audio, image - audio, etc.), especially in the case of specific modality combinations. 3. **High cost of adding new modalities**: Whenever a new modality is introduced, the existing framework usually needs to retrain the entire model, which is not only time - consuming but also increases the demand for computational resources. To solve these problems, the authors propose **OneEncoder**, a lightweight framework that can gradually represent and align four modalities (image, text, audio, video). Specifically, the main features of OneEncoder include: - **Lightweight design**: By using a pre - trained lightweight universal projection module (Universal Projection, UP), the need to train large - scale modality - specific encoders from scratch is avoided. - **Gradual alignment**: First, train the UP module to align the image and text modalities, then freeze the pre - trained UP and gradually align future new modalities (such as audio and video), only need to train the compact alignment layer (Alignment Layer, AL). - **Efficient and low - cost**: Even in the absence of a large number of alignment datasets, OneEncoder can operate efficiently and perform well in tasks such as classification, query, and visual question answering, surpassing methods that rely on large - scale datasets and specialized encoders. Through these improvements, OneEncoder provides a more cost - effective way to achieve multi - modal alignment, reduces the dependence on large - scale alignment datasets, and simplifies the integration process of new modalities.