Abstract:Humans usually perceive the world through information in different modalities, e.g., vision and hearing. By leveraging the relevance and complementary between audio and vision, humans can clearly distinguish different sound sources and infer which object is making sound. In contrast, machines have been proven capable of separately processing audio and visual information using deep neural networks. But can they benefit from joint audiovisual learning? Recent works mainly focus on establishing multi-modal relationship based on temporally synchronized audio and visual signals [1, 3, 8]. This synchronization works effectively for simple scenes [2, 9], i.e., the single-source conditions. However, in unconstrained videos, various sounds are usually mixed, where the scene-level supervision is too coarse to provide the precise alignment between each sound and visual source pair. To tackle this problem, [6, 7] establish audiovisual clusters to associate sound-object pairs, but require pre-determined number of clusters, which is difficult in unconstrained scenarios, thus greatly affecting alignment performance. [2, 9, 11] further apply audiovisual learning into sound localization, but mainly focus on simple scenes, usually unable to find source-specific objects from mixed audio. [13] constructs a pretext task then localizes sound through energy of each pixel. To sum up, existing dominant methods mostly lack the ability to analyze complex audiovisual scenes, and fail to effectively utilize the latent alignment between sound and visual source pairs in unconstrained videos. This is because there are majorly two challenges in complex audiovisual scene analysis: one is how to distinguish different soundsources, the other is how to ensure the established soundobject alignment is fairly satisfactory without one-to-one annotations. To address these challenges, we develop a twostage audiovisual learning framework. At the first stage, we employ a multi-task framework consisting of classification and audiovisual correspondence to provide the reference of audiovisual content for the second stage. At the second stage, based on the classification predictions, we use the operation of Class Activation Mapping (CAM) [14, 10, 4] to extract class-specific feature representations as the potential sound-object pairs (Fig. 1), then perform alignment in Shouting Boating Stream Audiovisual Pair

Exploiting Visual Context Semantics for Sound Source Localization.

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

Learning to Localize Sound Source in Visual Scenes

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Robust Audio-Visual Contrastive Learning for Proposal-based Self-supervised Sound Source Localization in Videos

Enhancing Sound Source Localization via False Negative Elimination

Multiple Sound Sources Localization from Coarse to Fine

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Localizing Visual Sounds the Easy Way

Self-Supervised Learning for Heterogeneous Audiovisual Scene Analysis.

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

A Two-Stage Framework for Multiple Sound-Source Localization

Sound Source Localization is All about Cross-Modal Alignment

Self-supervised Neural Audio-Visual Sound Source Localization via Probabilistic Spatial Modeling

How to Listen? Rethinking Visual Sound Localization

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

Unsupervised Sound Localization via Iterative Contrastive Learning

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Audio-Visual Event Localization by Learning Spatial and Semantic Co-attention

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound