Abstract:Humans usually perceive the world through information in different modalities, e.g., vision and hearing. By leveraging the relevance and complementary between audio and vision, humans can clearly distinguish different sound sources and infer which object is making sound. In contrast, machines have been proven capable of separately processing audio and visual information using deep neural networks. But can they benefit from joint audiovisual learning? Recent works mainly focus on establishing multi-modal relationship based on temporally synchronized audio and visual signals [1, 3, 8]. This synchronization works effectively for simple scenes [2, 9], i.e., the single-source conditions. However, in unconstrained videos, various sounds are usually mixed, where the scene-level supervision is too coarse to provide the precise alignment between each sound and visual source pair. To tackle this problem, [6, 7] establish audiovisual clusters to associate sound-object pairs, but require pre-determined number of clusters, which is difficult in unconstrained scenarios, thus greatly affecting alignment performance. [2, 9, 11] further apply audiovisual learning into sound localization, but mainly focus on simple scenes, usually unable to find source-specific objects from mixed audio. [13] constructs a pretext task then localizes sound through energy of each pixel. To sum up, existing dominant methods mostly lack the ability to analyze complex audiovisual scenes, and fail to effectively utilize the latent alignment between sound and visual source pairs in unconstrained videos. This is because there are majorly two challenges in complex audiovisual scene analysis: one is how to distinguish different soundsources, the other is how to ensure the established soundobject alignment is fairly satisfactory without one-to-one annotations. To address these challenges, we develop a twostage audiovisual learning framework. At the first stage, we employ a multi-task framework consisting of classification and audiovisual correspondence to provide the reference of audiovisual content for the second stage. At the second stage, based on the classification predictions, we use the operation of Class Activation Mapping (CAM) [14, 10, 4] to extract class-specific feature representations as the potential sound-object pairs (Fig. 1), then perform alignment in Shouting Boating Stream Audiovisual Pair

DMMAN: A Two-Stage Audio-Visual Fusion Framework for Sound Separation and Event Localization

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Sentiment Analysis Using Deep Robust Complementary Fusion of Multi-Features and Multi-Modalities.

Dense Modality Interaction Network for Audio-Visual Event Localization

Multi-Modulation Network for Audio-Visual Event Localization

Dual Attention Matching for Audio-Visual Event Localization.

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance Estimation

Audio-visual speech separation based on joint feature representation with cross-modal attention

MPN: Multimodal Parallel Network for Audio-Visual Event Localization

A Two-Stage Framework for Multiple Sound-Source Localization

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Multi-Attention Audio-Visual Fusion Network for Audio Spatialization

Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network

Audio-Visual Event Localization in Unconstrained Videos

Multimodal Fusion for Indoor Sound Source Localization.

Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization

Multi-channel Multi-frame ADL-MVDR for Target Speech Separation