Abstract:This paper presents Team Xaiofei's innovative approach to exploring Face-Voice Association in Multilingual Environments (FAME) at ACM Multimedia 2024. We focus on the impact of different languages in face-voice matching by building upon Fusion and Orthogonal Projection (FOP), introducing four key components: a dual-branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy. Our dual-branch structure serves as an auxiliary mechanism to better integrate and provide more comprehensive information. We also introduce a dynamic weighting mechanism for various sample pairs to optimize learning. Data augmentation techniques are employed to enhance the model's generalization across diverse conditions. Additionally, score polarization strategy based on age and gender matching confidence clarifies and accentuates the final results. Our methods demonstrate significant effectiveness, achieving an equal error rate (EER) of 20.07 on the V2-EH dataset and 21.76 on the V1-EU dataset.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to conduct robust face - voice matching in a multilingual environment. Specifically, researchers are concerned with the influence of different languages on the association of facial and voice features, and propose innovative methods to improve the accuracy of cross - modal verification. The following are the specific problems and solutions proposed in the paper:
### 1. Research Background and Problems
In the field of pattern recognition, the face and voice, as important biometric features, carry rich identity information. However, in a multilingual environment, face - voice matching faces many challenges, such as language differences, environmental noise, etc. Therefore, researchers hope to improve the robustness and accuracy of face - voice matching by improving existing methods.
### 2. Main Problems
- **Multilingual Influence**: Voice features of different languages may affect the effect of face - voice matching.
- **Cross - modal Verification**: It is necessary to determine whether the face and voice in a given sample belong to the same person.
- **Generalization Ability in Complex Scenarios**: The model needs to maintain good performance in different languages and environments.
### 3. Solutions
To solve the above problems, the paper proposes the following four key technical components:
1. **Dual - branch Structure**:
- By introducing a dual - branch structure, the model can better integrate and provide more comprehensive information. One branch is based on the pre - trained FOP (Fusion and Orthogonal Projection), and the other branch is an updated FOP.
- The dual - branch structure fuses the output results through a learnable attention layer, enhancing the model's ability to process multi - modal data.
2. **Dynamic Sample Pair Weighting**:
- A dynamic weighting mechanism is introduced to dynamically adjust the weights according to the similarity of sample pairs, making the model pay more attention to challenging sample pairs.
- By adjusting the weights of positive and negative sample pairs, the loss function is optimized to improve the accuracy of classification tasks.
3. **Robust Data Augmentation**:
- Use data augmentation techniques to generate more diverse training samples, break the original pairing relationships, and simulate more actual scenarios.
- By randomly generating additional training sample pairs, the generalization ability of the model is enhanced.
4. **Score Polarization Strategy**:
- Adjust the final score based on the matching confidence of age and gender to clarify and highlight the final result.
- By setting thresholds and polarization factors, the prediction performance of the model is optimized.
### 4. Experimental Results
Through experiments on the MAV - Celeb dataset, the paper demonstrates the effectiveness of the proposed method. In particular, on the V2 - EH and V1 - EU datasets, the model achieved equal error rates (EER) of 20.07% and 21.76% respectively, significantly outperforming the baseline model.
### Summary
By introducing the dual - branch structure, dynamic sample pair weighting, robust data augmentation, and score polarization strategy, the paper effectively improves the robustness and accuracy of face - voice matching in a multilingual environment. These methods not only improve the performance of the model but also enhance its generalization ability in complex environments.