Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

Ruijie Tao,Zhan Shi,Yidi Jiang,Duc-Tuan Truong,Eng-Siong Chng,Massimo Alioto,Haizhou Li
2024-07-25
Abstract:The human brain has the capability to associate the unknown person's voice and face by leveraging their general relationship, referred to as ``cross-modal speaker verification''. This task poses significant challenges due to the complex relationship between the modalities. In this paper, we propose a ``Multi-stage Face-voice Association Learning with Keynote Speaker Diarization''~(MFV-KSD) framework. MFV-KSD contains a keynote speaker diarization front-end to effectively address the noisy speech inputs issue. To balance and enhance the intra-modal feature learning and inter-modal correlation understanding, MFV-KSD utilizes a novel three-stage training strategy. Our experimental results demonstrated robust performance, achieving the first rank in the 2024 Face-voice Association in Multilingual Environments (FAME) challenge with an overall Equal Error Rate (EER) of 19.9%. Details can be found in <a class="link-external link-https" href="https://github.com/TaoRuijie/MFV-KSD" rel="external noopener nofollow">this https URL</a>.
Audio and Speech Processing
What problem does this paper attempt to address?
The paper aims to address the problem of cross-modal speaker verification, specifically how to determine whether a static facial image and a segment of speech signal belong to the same person. This task faces two main challenges: 1. **Data Issues**: In real-world application scenarios, raw audio typically contains conversations or non-speech segments, which can interfere with the speaker model's ability to extract accurate embeddings from single-modal inputs. Additionally, traditional speaker diarization methods struggle to distinguish the target speaker's speech segments in the audio, especially when the target speaker's voice is dominant. 2. **Learning Issues**: Effectively modeling the association between facial appearance and voice characteristics is crucial. It is necessary not only to generate accurate and robust single-modal (facial or voice) embeddings but also to establish cross-modal alignment. Simple joint modeling methods often fail to avoid overfitting and complicate the understanding of single-modal information and cross-modal correlations. To address the above challenges, the paper proposes a "Multi-stage Face-Voice Association Learning and Key Speaker Diarization" (MFV-KSD) framework. This framework includes a key speaker diarization frontend and a novel three-stage training strategy. The key speaker diarization frontend is used to filter out interfering speech inputs, making the target speaker's embeddings more accurate and robust. The three-stage training strategy includes single-modal recognition training, cross-modal correlation training, and adaptive training for specific competitions (FAME 2024 Challenge), to balance and enhance single-modal learning capabilities and cross-modal alignment capabilities. The method proposed in the paper achieved an overall Equal Error Rate (EER) of 19.9% in the FAME 2024 Challenge, demonstrating the effectiveness and superiority of the proposed method.