Abstract:Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.
Computer Vision and Pattern Recognition,Artificial Intelligence,Human-Computer Interaction,Information Retrieval
What problem does this paper attempt to address?
This paper attempts to solve the subtle alignment problems in multi - modal embeddings, which can lead to a decline in model performance and weakened generalization ability. Specifically:
1. **Misalignment problems in multi - modal embeddings**: Multi - modal embeddings have complex many - to - many mapping relationships between different modalities (such as text and image), which are prone to concept entanglement. For example, in the text - to - image generation task, the text prompt "Monet's water lily pond" may be confused with the "bridge" concept in the image, thus reducing the diversity of the generated image.
2. **Limitations of existing methods**: Existing methods for evaluating misalignment usually rely on reference - based evaluations (such as CIDEr and SPICE), which require a large amount of manually - annotated reference data; while reference - free evaluation methods (such as CLIPScore), although not requiring reference data, still rely on pre - trained models and are difficult to detect misalignment in diverse scenarios. In addition, existing fine - tuning techniques have limited effectiveness in dealing with misalignment in specific scenarios.
To solve these problems, the paper proposes an interactive system named **ModalChorus** for the visual detection and alignment of multi - modal embeddings. ModalChorus mainly includes two stages:
- **Embedding detection stage**: A new parametric dimensionality reduction method **Modal Fusion Map (MFM)** is introduced, which combines metric and non - metric objectives to enhance modal fusion. MFM effectively solves the modal gap problem by preserving the relative order of intra - modal distances and cross - modal distances.
- **Embedding alignment stage**: An interactive alignment scheme is designed, which supports point - set alignment and set - alignment, allowing users to perform fine - grained alignment operations according to their intentions. In addition, a concept - axis view is provided for the linear representation of the detection and alignment of multi - modal embeddings.
Through these two stages, ModalChorus can help users intuitively discover and correct misalignment problems in multi - modal embeddings, thereby improving the performance and generalization ability of the model.
### Formula presentation
Some formulas are involved in the description of MFM. The following are the key formulas presented in Markdown format:
- The calculation formula of the combined distance matrix \( M \):
\[
M=\begin{pmatrix}
I_I&I_T\\
T_I&T_T
\end{pmatrix}
\]
where \( I_I \) is the image distance sub - matrix, \( T_T \) is the text distance sub - matrix, and \( I_T \) and \( T_I \) are the cross - modal distance sub - matrices respectively.
- Assume that there is a subspace or manifold surface \( S \) in the high - dimensional embedding space, so that the embeddings from two modalities projected onto this surface can obtain an optimized two - dimensional parametric representation \( S(x, y) \).
Through these formulas, MFM can flexibly combine different objectives to achieve the projection of joint multi - modal embeddings.
### Summary
This paper aims to solve the subtle alignment problems in multi - modal embeddings, and proposes a new interactive system ModalChorus. By introducing Modal Fusion Map and an interactive alignment scheme, it helps users intuitively discover and correct misalignment problems in multi - modal embeddings, thereby enhancing model performance.