Abstract:Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.

What problem does this paper attempt to address?

This paper attempts to solve the subtle alignment problems in multi - modal embeddings, which can lead to a decline in model performance and weakened generalization ability. Specifically: 1. **Misalignment problems in multi - modal embeddings**: Multi - modal embeddings have complex many - to - many mapping relationships between different modalities (such as text and image), which are prone to concept entanglement. For example, in the text - to - image generation task, the text prompt "Monet's water lily pond" may be confused with the "bridge" concept in the image, thus reducing the diversity of the generated image. 2. **Limitations of existing methods**: Existing methods for evaluating misalignment usually rely on reference - based evaluations (such as CIDEr and SPICE), which require a large amount of manually - annotated reference data; while reference - free evaluation methods (such as CLIPScore), although not requiring reference data, still rely on pre - trained models and are difficult to detect misalignment in diverse scenarios. In addition, existing fine - tuning techniques have limited effectiveness in dealing with misalignment in specific scenarios. To solve these problems, the paper proposes an interactive system named **ModalChorus** for the visual detection and alignment of multi - modal embeddings. ModalChorus mainly includes two stages: - **Embedding detection stage**: A new parametric dimensionality reduction method **Modal Fusion Map (MFM)** is introduced, which combines metric and non - metric objectives to enhance modal fusion. MFM effectively solves the modal gap problem by preserving the relative order of intra - modal distances and cross - modal distances. - **Embedding alignment stage**: An interactive alignment scheme is designed, which supports point - set alignment and set - alignment, allowing users to perform fine - grained alignment operations according to their intentions. In addition, a concept - axis view is provided for the linear representation of the detection and alignment of multi - modal embeddings. Through these two stages, ModalChorus can help users intuitively discover and correct misalignment problems in multi - modal embeddings, thereby improving the performance and generalization ability of the model. ### Formula presentation Some formulas are involved in the description of MFM. The following are the key formulas presented in Markdown format: - The calculation formula of the combined distance matrix \( M \): \[ M=\begin{pmatrix} I_I&I_T\\ T_I&T_T \end{pmatrix} \] where \( I_I \) is the image distance sub - matrix, \( T_T \) is the text distance sub - matrix, and \( I_T \) and \( T_I \) are the cross - modal distance sub - matrices respectively. - Assume that there is a subspace or manifold surface \( S \) in the high - dimensional embedding space, so that the embeddings from two modalities projected onto this surface can obtain an optimized two - dimensional parametric representation \( S(x, y) \). Through these formulas, MFM can flexibly combine different objectives to achieve the projection of joint multi - modal embeddings. ### Summary This paper aims to solve the subtle alignment problems in multi - modal embeddings, and proposes a new interactive system ModalChorus. By introducing Modal Fusion Map and an interactive alignment scheme, it helps users intuitively discover and correct misalignment problems in multi - modal embeddings, thereby enhancing model performance.

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP

What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Deep Multi-Modal Sets

Interpretation on Multi-modal Visual Fusion

From Unimodal to Multimodal: Scaling up Projectors to Align Modalities

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training

Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion

Improving Multi-Modal Learning with Uni-Modal Teachers

Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment

SimCMF: A Simple Cross-modal Fine-tuning Strategy from Vision Foundation Models to Any Imaging Modality

Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

Toward Robust Multimodal Learning using Multimodal Foundational Models

ModaVerse: Efficiently Transforming Modalities with LLMs

Leveraging Intra-modal and Inter-modal Interaction for Multi-Modal Entity Alignment

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation

Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities

Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval