Abstract:By observing a scene and listening to corresponding audio cues, humans can easily recognize where the sound is. To achieve such cross-modal perception on machines, existing methods take advantage of the maps obtained by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for prospective practical applications, we argue that these map-based methods only offer a coarse-grained and indirect description of the sound source. Additionally, these methods utilize a single audio-visual tuple at a time during self-supervised learning, causing the model to lose the crucial chance to reason about the data distribution of large-scale audio-visual samples. Although the introduction of Audio-Visual Contrastive Learning (AVCL) can effectively alleviate this issue, the contrastive set constructed by randomly sampling is based on the assumption that the audio and visual segments from all other videos are not semantically related. Since the resulting contrastive set contains a large number of faulty negatives, we believe that this assumption is rough. In this paper, we advocate a novel proposal-based solution that directly localizes the semantic object-level sound source, without any manual annotations. The Global Response Map (GRM) is incorporated as an unsupervised spatial constraint to filter those instances corresponding to a large number of sound-unrelated regions. As a result, our proposal-based Sound Source Localization (SSL) can be cast into a simpler Multiple Instance Learning (MIL) problem. To overcome the limitation of random sampling in AVCL, we propose a novel Active Contrastive Set Mining (ACSM) to mine the contrastive sets with informative and diverse negatives for robust AVCL. Our approaches achieve state-of-the-art (SOTA) performance when compared to several baselines on multiple SSL datasets with diverse scenarios.

Exploring Localization for Self-supervised Fine-grained Contrastive Learning

Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment.

Distilling Localization for Self-Supervised Representation Learning

Saliency Guided Contrastive Learning on Scene Images

Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos

LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition

Learning multi-view visual correspondences with self-supervision

Localized Region Contrast for Enhancing Self-Supervised Learning in Medical Image Segmentation

Learning Where to Learn in Cross-View Self-Supervised Learning

Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization

Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization

Coarse2Fine: Local Consistency Aware Re-prediction for Weakly Supervised Object Localization.

Robust Audio-Visual Contrastive Learning for Proposal-based Self-supervised Sound Source Localization in Videos

Fine-grained Discriminative Localization via Saliency-guided Faster R-CNN

A Survey on Contrastive Self-Supervised Learning

GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds

Graph-Based Contrastive Learning for Description and Detection of Local Features.

Remote Sensing Images Semantic Segmentation with General Remote Sensing Vision Model via a Self-Supervised Contrastive Learning Method.

RegionCL: Exploring Contrastive Region Pairs for Self-supervised Representation Learning

Deep auxiliary learning for visual localization using colorization task