Abstract:In typical multimodal contrastive learning, such as CLIP, encoders produce one point in the latent representation space for each input. However, one-point representation has difficulty in capturing the relationship and the similarity structure of a huge amount of instances in the real world. For richer classes of the similarity, we propose the use of weighted point clouds, namely, sets of pairs of weight and vector, as representations of instances. In this work, we theoretically show the benefit of our proposed method through a new understanding of the contrastive loss of CLIP, which we call symmetric InfoNCE. We clarify that the optimal similarity that minimizes symmetric InfoNCE is the pointwise mutual information, and show an upper bound of excess risk on downstream classification tasks of representations that achieve the optimal similarity. In addition, we show that our proposed similarity based on weighted point clouds consistently achieves the optimal similarity. To verify the effectiveness of our proposed method, we demonstrate pretraining of text-image representation models and classification tasks on common benchmarks.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper attempts to address the issue of inadequate representation and similarity structure in multimodal contrastive learning to model real-world concepts. Specifically: 1. **Limitations of Existing Methods**: - Existing multimodal contrastive learning methods (such as CLIP) typically convert each input (text or image) into a single point embedding in the latent space. - This single point representation struggles to capture the relationships and similarity structures among a large number of instances. - Real-world concepts have broad and inclusive relationships. For example, "a photo of a dog" can correspond to many different images, while "a photo of a poodle" corresponds to a subset of these images. 2. **Proposed New Method**: - To represent similarity more richly, the authors propose using weighted point clouds (i.e., a collection of weight and vector pairs) as the representation of instances, called Weighted Point Cloud Embedding (WPCE). - By reinterpreting the symmetric InfoNCE loss, the authors demonstrate that when minimizing the symmetric InfoNCE loss, the optimal similarity is pointwise mutual information. - The authors also show that similarity based on weighted point clouds can consistently achieve optimal similarity. 3. **Theoretical and Experimental Validation**: - The authors theoretically analyze the symmetric InfoNCE loss and provide an upper bound on the excess risk of representations in downstream classification tasks. - Through pre-training and classification task experiments on common benchmark datasets, the effectiveness of the proposed method is validated. ### Summary This paper aims to improve the representation and similarity structure in multimodal contrastive learning by introducing Weighted Point Cloud Embedding (WPCE) to better model complex real-world concepts. Through theoretical analysis and experimental validation, the authors demonstrate the advantages of WPCE in enhancing similarity representation capabilities and downstream task performance.

Weighted Point Cloud Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

Contrastive Multimodal Fusion with TupleInfoNCE

Adaptive Multi-head Contrastive Learning

Contrastive Learning Is Spectral Clustering On Similarity Graph

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

On the Importance of Contrastive Loss in Multimodal Learning

Towards Understanding the Mechanism of Contrastive Learning via Similarity Structure: A Theoretical Analysis

SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

Multi-Similarity Contrastive Learning

Modulated Contrast for Versatile Image Synthesis

Multimodal contrastive learning using point clouds and their rendered images

Topological Perspectives on Optimal Multimodal Embedding Spaces

Linking Representations with Multimodal Contrastive Learning

Similarity-Dissimilarity Loss with Supervised Contrastive Learning for Multi-label Classification

Hyperbolic Image-and-Pointcloud Contrastive Learning for 3D Classification

$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs

On the Generalization of Multi-modal Contrastive Learning

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning