Abstract:Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feature dimension that feeds from the audio and video module, it creates a higher computational overhead for the backend network. To address these problems, this work hypothesizes that instead of leveraging deterministic fusion operation, employing an efficient fusion technique may assist the network in learning efficiently and improve detection accuracy. This work proposes an efficient audiovisual fusion (AVF) with fewer feature dimensions that captures the correlations between facial regions and sound signals, focusing more on the discriminative facial features and associating them with the corresponding audio features. Furthermore, previous ASD works focus only on improving ASD performance by creating a large computational overhead using complex techniques such as adding sophisticated postprocessing, applying smoothing techniques on the classifier to refine the network outputs at multiple stages, or assembling the multiple network outputs. This work proposed a simple yet effective end-to-end ASD using the newly proposed feature fusion approach, the AVF. The proposed framework attained a mAP of 84.384% on the validation set of the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker. With this, this work outperformed previous works that did not apply the postprocessing tasks and attained competitive detection accuracy compared to other works that employed different postprocessing tasks. The proposed model also learns better on the unsynchronized raw AVA-ActiveSpeaker dataset. The ablation experiments under different image scale settings and noisy signals show the AFV’s effectiveness and robustness than the concatenation operation.

Multi-Attention Audio-Visual Fusion Network for Audio Spatialization

Efficient Audiovisual Fusion for Active Speaker Detection.

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network for Audio-visual Speech Enhancement

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

A Multi-level Attention Fusion Network for Weakly Supervised Audio Classification

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

MA-Stereo: Real-Time Stereo Matching Via Multi-Scale Attention Fusion and Spatial Error-Aware Refinement

Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

Attention-based Visual-Audio Fusion for Video Caption Generation.

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

Deep Audio-Visual Fusion Neural Network for Saliency Estimation.

SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting

Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction

SepFusion: Finding Optimal Fusion Structures for Visual Sound Separation

Time-Domain Audio-Visual Speech Separation on Low Quality Videos

Audio-visual speech separation based on joint feature representation with cross-modal attention

Relevance-guided Audio Visual Fusion for Video Saliency Prediction

Multi-Level Signal Fusion for Enhanced Weakly-Supervised Audio-Visual Video Parsing