Abstract:The development of technology enables the availability of abundant multimodal data, which can be utilized in many representation learning tasks. However, most methods ignore the rich modality correlation information stored in each multimodal object and fail to fully exploit the potential of multimodal data. To address the aforementioned issue, cross-modal contrastive learning methods are proposed to learn the similarity score of each modality pair in a self-/weakly-supervised manner and improve the model robustness. Though effective, contrastive learning based on unimodal representations might be, in some cases, inaccurate as unimodal representations fail to reveal the global information of multimodal objects. To this end, we propose a contrastive learning pipeline based on multimodal representations to learn from the global view, and devise multiple techniques to generate negative and positive samples for each anchor. To generate positive samples, we apply the mix-up operation to mix two multimodal representations of different objects that have the maximal label similarity. Moreover, we devise a permutation-invariant fusion mechanism to define the positive samples by permuting the input order of modalities for fusion and sampling various contrastive fusion networks. In this way, we force the multimodal representation to be invariant regarding the order of modalities and the structures of fusion networks, so that the model can capture high-level semantic information of multimodal objects. To define negative samples, for each modality, we randomly replace the unimodal representation with that from another dissimilar object when synthesizing the multimodal representation. By this means, the model is led to capture the high-level concurrence information and correspondence relationship between modalities within each object. We also directly define the multimodal representation from another object as a negative sample, where the chosen object shares the minimal label similarity with the anchor. The label information is leveraged in the proposed framework to learn a more discriminative multimodal embedding space for downstream tasks. Extensive experiments demonstrate that our method outperforms previous state-of-the-art baselines on the tasks of multimodal sentiment analysis and humor detection.

Contextual Augmented Global Contrast for Multimodal Intent Recognition

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Learning from the Global View: Supervised Contrastive Learning of Multimodal Representation

Cross-modal contrastive learning for multimodal sentiment recognition

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Tri-CLT: Learning Tri-Modal Representations with Contrastive Learning and Transformer for Multimodal Sentiment Recognition

Video Contrastive Learning with Global Context

Text-Centric Multimodal Contrastive Learning for Sentiment Analysis

Improving the Modality Representation with Multi-View Contrastive Learning for Multimodal Sentiment Analysis

Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis

Multiple Contrastive Learning for Multimodal Sentiment Analysis

Connecting Multi-modal Contrastive Representations

Multimodal Sentiment Analysis Based on Disentangled Representation Learning and Cross-Modal-context Association Mining

Enhancing Micro Gesture Recognition for Emotion Understanding via Context-aware Visual-Text Contrastive Learning

Self-adaptive Context and Modal-interaction Modeling For Multimodal Emotion Recognition

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition

Cross-modality Representation Interactive Learning for Multimodal Sentiment Analysis

Hybrid Contrastive Learning of Tri-Modal Representation for Multimodal Sentiment Analysis

Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition

Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning