CLAP: Contrastive Language-Audio Pre-training Model for Multi-modal Sentiment Analysis.

Tianqi Zhao,Ming Kong,Tian Liang,Qiang Zhu,Kun Kuang,Fei Wu
DOI: https://doi.org/10.1145/3591106.3592296
2023-01-01
Abstract:Multi-modal Sentiment Analysis (MSA) is a hotspot of multi-modal fusion. To make full use of the correlation and complementarity between modalities in the process of fusing multi-modal data, we propose a two-stage framework of Contrastive Language-Audio Pre-training (CLAP) for the MSA task: 1) Making contrastive pretraining on an unlabeled large-scaled external data to yield better single-modal representations; 2) Adopting a Transformer-based multi-modal fusion module, to achieve further single-modal feature optimization and sentiment prediction via the task-driven training process. Our work fully demonstrates the importance and necessity of core elements such as pre-training, contrastive learning, and representation learning for the MSA task and significantly outperforms existing methods on two well-recognized MSA benchmarks.
What problem does this paper attempt to address?