Abstract:Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring $300,000\times$ fewer multimodal data pairs and $6\times$ fewer unimodal data for ImageNet classification and misinformative news captions detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to efficiently map unimodal features to the multimodal feature space using two unimodal encoders with a limited amount of data and achieve performance comparable to multimodal encoders such as CLIP?** Specifically, existing multimodal encoders (such as CLIP) perform well in tasks such as zero - shot image classification and cross - modal retrieval, but they require a large amount of training data. For example, OpenAI trained the CLIP model using 400 million image - text pairs and required 592 V100 GPUs. Moreover, more data does not necessarily guarantee better performance because the quality of Internet data varies, and mislabeled data may lead to failure modes for specific instances. To solve these problems, the authors proposed the **Canonical Similarity Analysis (CSA)** method. CSA achieves its goals in the following ways: 1. **Utilizing unimodal encoders**: CSA uses two pre - trained unimodal encoders (such as DINO and GTR). These encoders only require unimodal data, are easier to obtain, and require far less data than multimodal models. 2. **Mapping unimodal features to the multimodal space**: CSA maps unimodal features to a shared multimodal feature space while retaining multimodal information and removing redundant information. 3. **Introducing a new similarity score**: CSA uses a new weighted cosine similarity score to simulate CLIP's similarity score, thereby achieving various downstream tasks such as cross - modal retrieval, classification, and mislabeling detection. Through this method, CSA can outperform or match CLIP's performance with only a very small amount of paired multimodal data (300,000 times less than CLIP) and a small amount of unimodal data (6 times less than CLIP). In addition, CSA also supports other modality combinations (such as lidar and text), paving the way for future new modality combinations with limited paired multimodal data but sufficient unpaired unimodal data. ### Key Formulas - **Weighted Cosine Similarity Score**: \[ S(x_1^j, x_2^j; s)=\frac{\sum_{i = 1}^{s}\rho_i(A^*\hat{z}_1^j)_i(B^*\hat{z}_2^j)_i}{\|(A^*\hat{z}_1^j)_{1:s}\|_2\|(B^*\hat{z}_2^j)_{1:s}\|_2} \] where $(A^*\hat{z}_1^j)_{1:s}$ represents the row vector of the first $s$ dimensions, and $\rho_i$ is the $i$-th correlation coefficient. - **Canonical Correlation Analysis (CCA) Optimization Problem**: \[ A^*, B^*=\arg\max_{A\in\mathbb{R}^{r\times q_1}, B\in\mathbb{R}^{r\times q_2}}\text{Tr}(A\hat{Z}_1(B\hat{Z}_2)^\top) \] Subject to: \[ (A\hat{Z}_1)(A\hat{Z}_1)^\top=(B\hat{Z}_2)(B\hat{Z}_2)^\top = I_r \] Through these methods, CSA not only reduces the required computational resources and data volume but also demonstrates excellent performance in multiple tasks.

CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

Cross-modal Semantic Autoencoder with Embedding Consensus

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

CSA-Net: Deep Cross-Complementary Self Attention and Modality-Specific Preservation for Saliency Detection

UCSL: Toward Unsupervised Common Subspace Learning for Cross-Modal Image Classification

ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training

APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations

Topological Perspectives on Optimal Multimodal Embedding Spaces

CSA-Net: Channel-wise Spatially Autocorrelated Attention Networks

Do Vision and Language Encoders Represent the World Similarly?

CMSE: Cross-Modal Semantic Enhancement Network for Classification of Hyperspectral and LiDAR Data

Attention-based multimodal image matching

Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Cross-Modal Image Clustering Via Canonical Correlation Analysis

Set-CLIP: Exploring Aligned Semantic From Low-Alignment Multimodal Data Through A Distribution View

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Connecting Multi-modal Contrastive Representations

Cross-modal Semantic Interference Suppression for image-text matching

SCSA-Net: Presentation of two-view reliable correspondence learning via spatial-channel self-attention