Abstract:Humans usually perceive the world through information in different modalities, e.g., vision and hearing. By leveraging the relevance and complementary between audio and vision, humans can clearly distinguish different sound sources and infer which object is making sound. In contrast, machines have been proven capable of separately processing audio and visual information using deep neural networks. But can they benefit from joint audiovisual learning? Recent works mainly focus on establishing multi-modal relationship based on temporally synchronized audio and visual signals [1, 3, 8]. This synchronization works effectively for simple scenes [2, 9], i.e., the single-source conditions. However, in unconstrained videos, various sounds are usually mixed, where the scene-level supervision is too coarse to provide the precise alignment between each sound and visual source pair. To tackle this problem, [6, 7] establish audiovisual clusters to associate sound-object pairs, but require pre-determined number of clusters, which is difficult in unconstrained scenarios, thus greatly affecting alignment performance. [2, 9, 11] further apply audiovisual learning into sound localization, but mainly focus on simple scenes, usually unable to find source-specific objects from mixed audio. [13] constructs a pretext task then localizes sound through energy of each pixel. To sum up, existing dominant methods mostly lack the ability to analyze complex audiovisual scenes, and fail to effectively utilize the latent alignment between sound and visual source pairs in unconstrained videos. This is because there are majorly two challenges in complex audiovisual scene analysis: one is how to distinguish different soundsources, the other is how to ensure the established soundobject alignment is fairly satisfactory without one-to-one annotations. To address these challenges, we develop a twostage audiovisual learning framework. At the first stage, we employ a multi-task framework consisting of classification and audiovisual correspondence to provide the reference of audiovisual content for the second stage. At the second stage, based on the classification predictions, we use the operation of Class Activation Mapping (CAM) [14, 10, 4] to extract class-specific feature representations as the potential sound-object pairs (Fig. 1), then perform alignment in Shouting Boating Stream Audiovisual Pair

Multimodal Fusion for Indoor Sound Source Localization.

A Robust Localization Approach Using Multi-Sensor Fusion

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Bi-Direction Interaural Matching Filter and Decision Weighting Fusion for Sound Source Localization in Noisy Environments.

Multiple Sound Sources Localization from Coarse to Fine

AcousticFusion: Fusing Sound Source Localization to Visual SLAM in Dynamic Environments

A Novel Multimodal Feature-Level Fusion Scheme for High-Accurate Indoor Localization

Acoustic Source Localization in a Reverberant Environment Based on Sound Field Morphological Component Analysis and Alternating Direction Method of Multipliers

A Two-Stage Framework for Multiple Sound-Source Localization

Exploring Audio-Visual Information Fusion for Sound Event Localization and Detection In Low-Resource Realistic Scenarios

Mix and Localize: Localizing Sound Sources in Mixtures

DMMAN: A Two-Stage Audio-Visual Fusion Framework for Sound Separation and Event Localization

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

The Solution for Temporal Sound Localisation Task of ICCV 1st Perception Test Challenge 2023

Joint Spatio-Temporal-Frequency Representation Learning for Improved Sound Event Localization and Detection

Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Active Object Discovery and Localization Using Sound-Induced Attention

Indoor Multi-Sound Source Localization Based on Nonparametric Bayesian Clustering.

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

A Low-Cost and Efficient Indoor Fusion Localization Method

Multimodal Attention Fusion for Target Speaker Extraction