Abstract:By observing a scene and listening to corresponding audio cues, humans can easily recognize where the sound is. To achieve such cross-modal perception on machines, existing methods take advantage of the maps obtained by interpolation operations to localize the sound source. As semantic object-level localization is more attractive for prospective practical applications, we argue that these map-based methods only offer a coarse-grained and indirect description of the sound source. Additionally, these methods utilize a single audio-visual tuple at a time during self-supervised learning, causing the model to lose the crucial chance to reason about the data distribution of large-scale audio-visual samples. Although the introduction of Audio-Visual Contrastive Learning (AVCL) can effectively alleviate this issue, the contrastive set constructed by randomly sampling is based on the assumption that the audio and visual segments from all other videos are not semantically related. Since the resulting contrastive set contains a large number of faulty negatives, we believe that this assumption is rough. In this paper, we advocate a novel proposal-based solution that directly localizes the semantic object-level sound source, without any manual annotations. The Global Response Map (GRM) is incorporated as an unsupervised spatial constraint to filter those instances corresponding to a large number of sound-unrelated regions. As a result, our proposal-based Sound Source Localization (SSL) can be cast into a simpler Multiple Instance Learning (MIL) problem. To overcome the limitation of random sampling in AVCL, we propose a novel Active Contrastive Set Mining (ACSM) to mine the contrastive sets with informative and diverse negatives for robust AVCL. Our approaches achieve state-of-the-art (SOTA) performance when compared to several baselines on multiple SSL datasets with diverse scenarios.

Sound Localization by Self-Supervised Time Delay Estimation

Real-Time Space 3D Acoustic Location Based on Monte Carlo Algorithm

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

Unsupervised Sound Localization via Iterative Contrastive Learning

Self-supervised Moving Vehicle Tracking with Stereo Sound

Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

Mix and Localize: Localizing Sound Sources in Mixtures

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

Multitask learning of time-frequency CNN for sound source localization

Multiple Sound Sources Localization from Coarse to Fine

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

Self-supervised object detection from audio-visual correspondence

A Time-domain Unsupervised Learning Based Sound Source Localization Method

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Robust Acoustic Localization Via Time-Delay Compensation and Interaural Matching Filter

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Robust Audio-Visual Contrastive Learning for Proposal-based Self-supervised Sound Source Localization in Videos

Self-supervised Audio Spatialization with Correspondence Classifier

Learning to Localize Sound Source in Visual Scenes