Multimodal Pre-Train then Transfer Learning Approach for Speaker Recognition

Summaira Jabeen,Muhammad Shoib Amin,Xi Li
DOI: https://doi.org/10.1007/s11042-024-18575-4
IF: 2.577
2024-01-01
Multimedia Tools and Applications
Abstract:Cognitive science has well-established the correlation between faces and voices because neuro-cognitive pathways of both information share the same structure. Recently, the task has come to the attention of the computer vision community with the introduction of large-scale face-voice data. To this end, our work aims to leverage the structure of faces and voices along with the availability of large-scale face-voice information to improve speaker recognition tasks including identification and verification. To achieve this task, we propose novel multimodal systems to leverage the structure of face and voice, one with weight sharing and another without weight sharing, to learn joint representations of multiple modalities establishing the Face-voice association. Afterwards, features are extracted from the trained multimodal networks capturing face-voice association to perform speaker recognition tasks. We evaluated our proposed multimodal networks for speaker recognition along with Face-voice association tasks on challenging benchmark datasets including VoxCeleb1 and MAV-Celeb. Our results show that adding facial information improved speaker recognition tasks’ performance.
What problem does this paper attempt to address?