Abstract:Audio-visual speaker recognition is one of the tasks in the recent 2019 NIST speaker recognition evaluation (SRE). Studies in neuroscience and computer science all point to the fact that vision and auditory neural signals interact in the cognitive process. This motivated us to study a cross-modal network, namely voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the accuracy of speaker recognition in a multimedia environment. Specifically, the paper focuses on the Audio - Visual Speaker Recognition Evaluation (SRE) held by the National Institute of Standards and Technology (NIST) in 2019. This task requires verifying whether the speakers in a given video pair (registration video and test video) are the same person. To achieve this goal, the paper proposes a cross - modal discriminative network (VFNet), namely the Voice - Face discriminative network, to establish the relationship between human voices and faces, thereby providing additional speaker - distinguishing information. ### Background of the Paper Traditional speaker recognition systems usually only use audio information, but in practical applications, combining visual information can significantly improve recognition performance. The 2019 NIST SRE introduced the audio - visual speaker recognition task for the first time, requiring the system to be able to use audio and visual cues to verify the identity of the speaker. However, simply fusing the outputs of the audio and visual systems cannot fully utilize the correlation information between the two modalities. ### Method of the Paper The paper proposes a cross - modal discriminative network (VFNet), which enhances the performance of speaker recognition by learning the association between voices and faces. The specific steps are as follows: 1. **Feature Extraction**: - Use the x - vector system to extract speaker embeddings. - Use the InsightFace system to extract face embeddings. 2. **Network Structure**: - **Input Layer**: The network accepts two inputs, one is the speech waveform and the other is the face image. - **Fully - Connected Layers**: Process the speech embeddings and face embeddings respectively, and perform feature transformation through 256 - dimensional and 128 - dimensional fully - connected layers (FC1 and FC2). - **Similarity Calculation**: Calculate the similarity between the transformed speech embeddings and face embeddings through cosine similarity. - **Output Layer**: Use the softmax function to calculate the final confidence score, indicating whether the voice and the face belong to the same person. 3. **Loss Function**: - Use the cross - entropy loss function to optimize network parameters. ### Experimental Results The paper conducted experiments on the 2019 NIST SRE data set and compared with the baseline system. The experimental results show that the system using VFNet has a significant improvement over the baseline system in multiple indicators, specifically in the following aspects: - **Equal Error Rate (EER)**: It is relatively reduced by 16.54%. - **Minimum Detection Cost Function (minDCF)**: It is relatively increased by 2.00%. - **Actual Detection Cost Function (actDCF)**: It is relatively increased by 8.83%. ### Conclusion The paper proposes a new cross - modal discriminative network (VFNet) to improve the performance of audio - visual speaker recognition by learning the association between voices and faces. The experimental results show that VFNet is superior to the baseline system in multiple evaluation indicators, proving its effectiveness and superiority in practical applications.

Audio-visual Speaker Recognition with a Cross-modal Discriminative Network

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

AudioVSR: Enhancing Video Speech Recognition with Audio Data

HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

Integration of audio-visual information for multi-speaker multimedia speaker recognition

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Audio-Visual Speaker Verification via Joint Cross-Attention

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Robust end-to-end deep audiovisual speech recognition

The 2021 NIST Speaker Recognition Evaluation

Audio-visual multi-channel speech separation, dereverberation and recognition

Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

STC Speaker Recognition Systems for the VOiCES From a Distance Challenge

CATNet: Cross-modal fusion for audio-visual speech recognition

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

HNet: A deep learning based hybrid network for speaker dependent visual speech recognition

Resource aware design of a deep convolutional-recurrent neural network for speech recognition through audio-visual sensor fusion

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder