Abstract:By observing a scene and listening to corresponding audio cues, humans can easily recognize where the sound is. To achieve such cross-modal perception on machines, existing methods take advantage of the maps obtained by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for prospective practical applications, we argue that these map-based methods only offer a coarse-grained and indirect description of the sound source. Additionally, these methods utilize a single audio-visual tuple at a time during self-supervised learning, causing the model to lose the crucial chance to reason about the data distribution of large-scale audio-visual samples. Although the introduction of Audio-Visual Contrastive Learning (AVCL) can effectively alleviate this issue, the contrastive set constructed by randomly sampling is based on the assumption that the audio and visual segments from all other videos are not semantically related. Since the resulting contrastive set contains a large number of faulty negatives, we believe that this assumption is rough. In this paper, we advocate a novel proposal-based solution that directly localizes the semantic object-level sound source, without any manual annotations. The Global Response Map (GRM) is incorporated as an unsupervised spatial constraint to filter those instances corresponding to a large number of sound-unrelated regions. As a result, our proposal-based Sound Source Localization (SSL) can be cast into a simpler Multiple Instance Learning (MIL) problem. To overcome the limitation of random sampling in AVCL, we propose a novel Active Contrastive Set Mining (ACSM) to mine the contrastive sets with informative and diverse negatives for robust AVCL. Our approaches achieve state-of-the-art (SOTA) performance when compared to several baselines on multiple SSL datasets with diverse scenarios.

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Self-supervised Contrastive Learning for Audio-Visual Action Recognition

Sequential Contrastive Audio-Visual Learning

Enhancing Audio-Visual Association with Self-Supervised Curriculum Learning

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Audio-Visual Contrastive Learning with Temporal Self-Supervision

Enhancing Sound Source Localization via False Negative Elimination

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Cross-Modal Contrastive Representation Learning for Audio-to-Image Generation

Robust Audio-Visual Contrastive Learning for Proposal-based Self-supervised Sound Source Localization in Videos

Audio-Visual Contrastive and Consistency Learning for Semi-Supervised Action Recognition

MCL: A Contrastive Learning Method for Multimodal Data Fusion in Violence Detection

Accommodating Audio Modality in CLIP for Multimodal Processing

Audio-Visual Class-Incremental Learning

Improving Spoken Language Understanding with Cross-Modal Contrastive Learning

Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion

Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Contrastive Audio-Visual Masked Autoencoder