CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

Po-han Li,Sandeep P. Chinchali,Ufuk Topcu
2024-10-10
Abstract:Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring $300,000\times$ fewer multimodal data pairs and $6\times$ fewer unimodal data for ImageNet classification and misinformative news captions detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to efficiently map unimodal features to the multimodal feature space using two unimodal encoders with a limited amount of data and achieve performance comparable to multimodal encoders such as CLIP?** Specifically, existing multimodal encoders (such as CLIP) perform well in tasks such as zero - shot image classification and cross - modal retrieval, but they require a large amount of training data. For example, OpenAI trained the CLIP model using 400 million image - text pairs and required 592 V100 GPUs. Moreover, more data does not necessarily guarantee better performance because the quality of Internet data varies, and mislabeled data may lead to failure modes for specific instances. To solve these problems, the authors proposed the **Canonical Similarity Analysis (CSA)** method. CSA achieves its goals in the following ways: 1. **Utilizing unimodal encoders**: CSA uses two pre - trained unimodal encoders (such as DINO and GTR). These encoders only require unimodal data, are easier to obtain, and require far less data than multimodal models. 2. **Mapping unimodal features to the multimodal space**: CSA maps unimodal features to a shared multimodal feature space while retaining multimodal information and removing redundant information. 3. **Introducing a new similarity score**: CSA uses a new weighted cosine similarity score to simulate CLIP's similarity score, thereby achieving various downstream tasks such as cross - modal retrieval, classification, and mislabeling detection. Through this method, CSA can outperform or match CLIP's performance with only a very small amount of paired multimodal data (300,000 times less than CLIP) and a small amount of unimodal data (6 times less than CLIP). In addition, CSA also supports other modality combinations (such as lidar and text), paving the way for future new modality combinations with limited paired multimodal data but sufficient unpaired unimodal data. ### Key Formulas - **Weighted Cosine Similarity Score**: \[ S(x_1^j, x_2^j; s)=\frac{\sum_{i = 1}^{s}\rho_i(A^*\hat{z}_1^j)_i(B^*\hat{z}_2^j)_i}{\|(A^*\hat{z}_1^j)_{1:s}\|_2\|(B^*\hat{z}_2^j)_{1:s}\|_2} \] where \((A^*\hat{z}_1^j)_{1:s}\) represents the row vector of the first \(s\) dimensions, and \(\rho_i\) is the \(i\)-th correlation coefficient. - **Canonical Correlation Analysis (CCA) Optimization Problem**: \[ A^*, B^*=\arg\max_{A\in\mathbb{R}^{r\times q_1}, B\in\mathbb{R}^{r\times q_2}}\text{Tr}(A\hat{Z}_1(B\hat{Z}_2)^\top) \] Subject to: \[ (A\hat{Z}_1)(A\hat{Z}_1)^\top=(B\hat{Z}_2)(B\hat{Z}_2)^\top = I_r \] Through these methods, CSA not only reduces the required computational resources and data volume but also demonstrates excellent performance in multiple tasks.