SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

Chao Sun,Min Chen,Jialiang Cheng,Han Liang,Chuanbo Zhu,Jincai Chen
DOI: https://doi.org/10.1145/3581783.3613805
2023-01-01
Abstract:Audio and vision are important senses for high-level cognition, and their special strong correlation makes audio-visual coding a crucial factor in many multimodal tasks. However, there are two challenges in audio-visual coding. First, the heterogeneity of multimodal data often leads to misalignment of cross-modal features under the same sample, which reduces their representation quality. Second, most self-supervised learning frameworks are constructed based on instance semantics, and the generated pseudo labels introduce additional classification noise. To address these challenges, we propose a Supervised Cross-modal Contrastive Learning Framework for Audio-Visual Coding (SCLAV). Our framework includes an audio-visual coding network composed of an inter-modal attention interaction module and an intra-modal self-integration module, which leverage multimodal complementary and hidden information for better representation. Additionally, we introduce a supervised cross-modal contrastive loss to minimize the distance between audio and vision features of the same instance, and use weak labels of multimodal data to eliminate the feature-oriented classification noise. Extensive experiments on the AVE and XD-Violence datasets demonstrate that SCLAV outperforms the state-of-the-art results, even with limited computational resources.
What problem does this paper attempt to address?