Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

Ruijie Tao,Zhan Shi,Yidi Jiang,Duc-Tuan Truong,Eng-Siong Chng,Massimo Alioto,Haizhou Li

2024-07-25

Abstract:The human brain has the capability to associate the unknown person's voice and face by leveraging their general relationship, referred to as ``cross-modal speaker verification''. This task poses significant challenges due to the complex relationship between the modalities. In this paper, we propose a ``Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''~(MFV-KSD) framework. MFV-KSD contains a keynote speaker diarization front-end to effectively address the noisy speech inputs issue. To balance and enhance the intra-modal feature learning and inter-modal correlation understanding, MFV-KSD utilizes a novel three-stage training strategy. Our experimental results demonstrated robust performance, achieving the first rank in the 2024 Face-voice Association in Multilingual Environments (FAME) challenge with an overall Equal Error Rate (EER) of 19.9%. Details can be found in <a class="link-external link-https" href="https://github.com/TaoRuijie/MFV-KSD" rel="external noopener nofollow">this https URL</a>.

Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the problem of cross-modal speaker verification, specifically how to determine whether a static facial image and a segment of speech signal belong to the same person. This task faces two main challenges: 1. **Data Issues**: In real-world application scenarios, raw audio typically contains conversations or non-speech segments, which can interfere with the speaker model's ability to extract accurate embeddings from single-modal inputs. Additionally, traditional speaker diarization methods struggle to distinguish the target speaker's speech segments in the audio, especially when the target speaker's voice is dominant. 2. **Learning Issues**: Effectively modeling the association between facial appearance and voice characteristics is crucial. It is necessary not only to generate accurate and robust single-modal (facial or voice) embeddings but also to establish cross-modal alignment. Simple joint modeling methods often fail to avoid overfitting and complicate the understanding of single-modal information and cross-modal correlations. To address the above challenges, the paper proposes a "Multi-stage Face-Voice Association Learning and Key Speaker Diarization" (MFV-KSD) framework. This framework includes a key speaker diarization frontend and a novel three-stage training strategy. The key speaker diarization frontend is used to filter out interfering speech inputs, making the target speaker's embeddings more accurate and robust. The three-stage training strategy includes single-modal recognition training, cross-modal correlation training, and adaptive training for specific competitions (FAME 2024 Challenge), to balance and enhance single-modal learning capabilities and cross-modal alignment capabilities. The method proposed in the paper achieved an overall Equal Error Rate (EER) of 19.9% in the FAME 2024 Challenge, demonstrating the effectiveness and superiority of the proposed method.

Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Audio-visual Speaker Recognition with a Cross-modal Discriminative Network

DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Exploring Robust Face-Voice Matching in Multilingual Environments

Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

Multi-Input Multi-Output Target-Speaker Voice Activity Detection For Unified, Flexible, and Robust Audio-Visual Speaker Diarization

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Uncertainty-Guided End-to-End Audio-Visual Speaker Diarization for Far-Field Recordings

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Integration of audio-visual information for multi-speaker multimedia speaker recognition

Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition