Abstract:This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy. Our dual-branch structure serves as an auxiliary mechanism to better integrate and provide more comprehensive information. We also introduce a dynamic weighting mechanism for various sample pairs to optimize learning. Data augmentation techniques are employed to enhance the model's generalization across diverse conditions. Additionally, score polarization strategy based on age and gender matching confidence clarifies and accentuates the final results. Our methods demonstrate significant effectiveness, achieving an equal error rate (EER) of 20.07 on the V2-EH dataset and 21.76 on the V1-EU dataset.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to conduct robust face - voice matching in a multilingual environment. Specifically, researchers are concerned with the influence of different languages on the association of facial and voice features, and propose innovative methods to improve the accuracy of cross - modal verification. The following are the specific problems and solutions proposed in the paper: ### 1. Research Background and Problems In the field of pattern recognition, the face and voice, as important biometric features, carry rich identity information. However, in a multilingual environment, face - voice matching faces many challenges, such as language differences, environmental noise, etc. Therefore, researchers hope to improve the robustness and accuracy of face - voice matching by improving existing methods. ### 2. Main Problems - **Multilingual Influence**: Voice features of different languages may affect the effect of face - voice matching. - **Cross - modal Verification**: It is necessary to determine whether the face and voice in a given sample belong to the same person. - **Generalization Ability in Complex Scenarios**: The model needs to maintain good performance in different languages and environments. ### 3. Solutions To solve the above problems, the paper proposes the following four key technical components: 1. **Dual - branch Structure**: - By introducing a dual - branch structure, the model can better integrate and provide more comprehensive information. One branch is based on the pre - trained FOP (Fusion and Orthogonal Projection), and the other branch is an updated FOP. - The dual - branch structure fuses the output results through a learnable attention layer, enhancing the model's ability to process multi - modal data. 2. **Dynamic Sample Pair Weighting**: - A dynamic weighting mechanism is introduced to dynamically adjust the weights according to the similarity of sample pairs, making the model pay more attention to challenging sample pairs. - By adjusting the weights of positive and negative sample pairs, the loss function is optimized to improve the accuracy of classification tasks. 3. **Robust Data Augmentation**: - Use data augmentation techniques to generate more diverse training samples, break the original pairing relationships, and simulate more actual scenarios. - By randomly generating additional training sample pairs, the generalization ability of the model is enhanced. 4. **Score Polarization Strategy**: - Adjust the final score based on the matching confidence of age and gender to clarify and highlight the final result. - By setting thresholds and polarization factors, the prediction performance of the model is optimized. ### 4. Experimental Results Through experiments on the MAV - Celeb dataset, the paper demonstrates the effectiveness of the proposed method. In particular, on the V2 - EH and V1 - EU datasets, the model achieved equal error rates (EER) of 20.07% and 21.76% respectively, significantly outperforming the baseline model. ### Summary By introducing the dual - branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy, the paper effectively improves the robustness and accuracy of face - voice matching in a multilingual environment. These methods not only improve the performance of the model but also enhance its generalization ability in complex environments.

Exploring Robust Face-Voice Matching in Multilingual Environments

Robust Face Recognition by Fusion Local Singular Value Feature and Deformable Model

Towards Mask-robust Face Recognition.

Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

An Adaptive Fuzzy Fusion Framework for Face Recognition under Illumination Variation Based on Local Multiple Patterns.

Combining 2D Gabor and Local Binary Pattern for Facial Expression Recognition Using Extreme Learning Machine

Face-voice Association in Multilingual Environments (FAME) Challenge 2024 Evaluation Plan

Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

Multi-Stage Face-Voice Association Learning with Keynote Speaker Diarization

Fine Alignment, Flexible Fusion: A Novel Framework of Multi-Model Score Fusion in Face Identification

Robust Face Recognition via Multimodal Deep Face Representation

Multi Loss-based Feature Fusion and Top Two Voting Ensemble Decision Strategy for Facial Expression Recognition in the Wild

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Facial Expression Recognition Based on Multi-modal Features for Videos in the Wild

A Generalist FaceX via Learning Unified Facial Representation

Fusing magnitude and phase features with multiple face models for robust face recognition

Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

Dual-model self-regularization and fusion for domain adaptation of robust speaker verification