Abstract:By observing a scene and listening to corresponding audio cues, humans can easily recognize where the sound is. To achieve such cross-modal perception on machines, existing methods take advantage of the maps obtained by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for prospective practical applications, we argue that these map-based methods only offer a coarse-grained and indirect description of the sound source. Additionally, these methods utilize a single audio-visual tuple at a time during self-supervised learning, causing the model to lose the crucial chance to reason about the data distribution of large-scale audio-visual samples. Although the introduction of Audio-Visual Contrastive Learning (AVCL) can effectively alleviate this issue, the contrastive set constructed by randomly sampling is based on the assumption that the audio and visual segments from all other videos are not semantically related. Since the resulting contrastive set contains a large number of faulty negatives, we believe that this assumption is rough. In this paper, we advocate a novel proposal-based solution that directly localizes the semantic object-level sound source, without any manual annotations. The Global Response Map (GRM) is incorporated as an unsupervised spatial constraint to filter those instances corresponding to a large number of sound-unrelated regions. As a result, our proposal-based Sound Source Localization (SSL) can be cast into a simpler Multiple Instance Learning (MIL) problem. To overcome the limitation of random sampling in AVCL, we propose a novel Active Contrastive Set Mining (ACSM) to mine the contrastive sets with informative and diverse negatives for robust AVCL. Our approaches achieve state-of-the-art (SOTA) performance when compared to several baselines on multiple SSL datasets with diverse scenarios.

On‐the‐Job Search and the Wage Distribution

Investigating Self-Supervised Learning for Speech Enhancement and Separation

Exploring Federated Self-Supervised Learning for General Purpose Audio Understanding

A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends

Positive and negative sampling strategies for self-supervised learning on audio-video data

Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations

A Survey of Self-Supervised Learning from Multiple Perspectives: Algorithms, Theory, Applications and Future Trends

Improving Self-Supervised Learning for Audio Representations by Feature Diversity and Decorrelation

Self-supervised Learning for Electroencephalogram: A Systematic Survey

Sound and Visual Representation Learning with Multiple Pretraining Tasks

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Connecting the Dots in Self-Supervised Learning: A Brief Survey for Beginners

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics

Self-Supervised Learning for Few-Shot Bird Sound Classification

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

On the Utility of Self-supervised Models for Prosody-related Tasks

Towards Supervised Performance on Speaker Verification with Self-Supervised Learning by Leveraging Large-Scale ASR Models

Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction

Linear-Complexity Self-Supervised Learning for Speech Processing

Robust Audio-Visual Contrastive Learning for Proposal-based Self-supervised Sound Source Localization in Videos

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer