Audio-visual Speaker Recognition with a Cross-modal Discriminative Network

Ruijie Tao,Rohan Kumar Das,Haizhou Li
DOI: https://doi.org/10.48550/arXiv.2008.03894
2020-08-10
Abstract:Audio-visual speaker recognition is one of the tasks in the recent 2019 NIST speaker recognition evaluation (SRE). Studies in neuroscience and computer science all point to the fact that vision and auditory neural signals interact in the cognitive process. This motivated us to study a cross-modal network, namely voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.
Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the accuracy of speaker recognition in a multimedia environment. Specifically, the paper focuses on the Audio - Visual Speaker Recognition Evaluation (SRE) held by the National Institute of Standards and Technology (NIST) in 2019. This task requires verifying whether the speakers in a given video pair (registration video and test video) are the same person. To achieve this goal, the paper proposes a cross - modal discriminative network (VFNet), namely the Voice - Face discriminative network, to establish the relationship between human voices and faces, thereby providing additional speaker - distinguishing information. ### Background of the Paper Traditional speaker recognition systems usually only use audio information, but in practical applications, combining visual information can significantly improve recognition performance. The 2019 NIST SRE introduced the audio - visual speaker recognition task for the first time, requiring the system to be able to use audio and visual cues to verify the identity of the speaker. However, simply fusing the outputs of the audio and visual systems cannot fully utilize the correlation information between the two modalities. ### Method of the Paper The paper proposes a cross - modal discriminative network (VFNet), which enhances the performance of speaker recognition by learning the association between voices and faces. The specific steps are as follows: 1. **Feature Extraction**: - Use the x - vector system to extract speaker embeddings. - Use the InsightFace system to extract face embeddings. 2. **Network Structure**: - **Input Layer**: The network accepts two inputs, one is the speech waveform and the other is the face image. - **Fully - Connected Layers**: Process the speech embeddings and face embeddings respectively, and perform feature transformation through 256 - dimensional and 128 - dimensional fully - connected layers (FC1 and FC2). - **Similarity Calculation**: Calculate the similarity between the transformed speech embeddings and face embeddings through cosine similarity. - **Output Layer**: Use the softmax function to calculate the final confidence score, indicating whether the voice and the face belong to the same person. 3. **Loss Function**: - Use the cross - entropy loss function to optimize network parameters. ### Experimental Results The paper conducted experiments on the 2019 NIST SRE data set and compared with the baseline system. The experimental results show that the system using VFNet has a significant improvement over the baseline system in multiple indicators, specifically in the following aspects: - **Equal Error Rate (EER)**: It is relatively reduced by 16.54%. - **Minimum Detection Cost Function (minDCF)**: It is relatively increased by 2.00%. - **Actual Detection Cost Function (actDCF)**: It is relatively increased by 8.83%. ### Conclusion The paper proposes a new cross - modal discriminative network (VFNet) to improve the performance of audio - visual speaker recognition by learning the association between voices and faces. The experimental results show that VFNet is superior to the baseline system in multiple evaluation indicators, proving its effectiveness and superiority in practical applications.