Abstract:Active speaker detection (ASD) refers to detecting the speaking person among visible human instances in a video. Existing methods widely employed a similar audiovisual fusion approach, the concatenation. Although such a fusion approach is often argued to help enhance performance, it must be noted that neither feature modalities play an equal role. It forces the backend network to focus on learning intramodal rather than intermodal features. Another concern is that since the concatenation doubles the fused feature dimension that feeds from the audio and video module, it creates a higher computational overhead for the backend network. To address these problems, this work hypothesizes that instead of leveraging deterministic fusion operation, employing an efficient fusion technique may assist the network in learning efficiently and improve detection accuracy. This work proposes an efficient audiovisual fusion (AVF) with fewer feature dimensions that captures the correlations between facial regions and sound signals, focusing more on the discriminative facial features and associating them with the corresponding audio features. Furthermore, previous ASD works focus only on improving ASD performance by creating a large computational overhead using complex techniques such as adding sophisticated postprocessing, applying smoothing techniques on the classifier to refine the network outputs at multiple stages, or assembling the multiple network outputs. This work proposed a simple yet effective end-to-end ASD using the newly proposed feature fusion approach, the AVF. The proposed framework attained a mAP of 84.384% on the validation set of the most challenging audiovisual speaker detection benchmark, the AVA-ActiveSpeaker. With this, this work outperformed previous works that did not apply the postprocessing tasks and attained competitive detection accuracy compared to other works that employed different postprocessing tasks. The proposed model also learns better on the unsynchronized raw AVA-ActiveSpeaker dataset. The ablation experiments under different image scale settings and noisy signals show the AFV’s effectiveness and robustness than the concatenation operation.

Dynamic Ensemble Teacher-Student Distillation Framework for Light-weight Fake Audio Detection

Efficient Audiovisual Fusion for Active Speaker Detection.

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Learning From Yourself: A Self-Distillation Method for Fake Speech Detection

FTDKD: Frequency-Time Domain Knowledge Distillation for Low-Quality Compressed Audio Deepfake Detection

Fully Automated End-to-End Fake Audio Detection.

Generalized Fake Audio Detection via Deep Stable Learning

Lightweight Voice Spoofing Detection Using Improved One-Class Learning and Knowledge Distillation

Frequency-mix Knowledge Distillation for Fake Speech Detection

Genuine-Focused Learning using Mask AutoEncoder for Generalized Fake Audio Detection

A lightweight feature extraction technique for deepfake audio detection

Continual Learning for Fake Audio Detection

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Advancing Continual Learning for Robust Deepfake Audio Classification

Speaker Recognition-Assisted Robust Audio Deepfake Detection

Towards Robust Audio Deepfake Detection: A Evolving Benchmark for Continual Learning

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

DDFAD: Dataset Distillation Framework for Audio Data

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

Adaptive Fake Audio Detection with Low-Rank Model Squeezing