Abstract:Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task. Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks, improving verification by approximately 3%, matching by about 2.5%, and retrieval by around 1.3%.

What problem does this paper attempt to address?

The paper primarily focuses on addressing several key issues in the learning of associations between speech and faces, and proposes a new framework—Fuse after Align (FAA)—to improve the associative learning between speech and faces. Specifically, the paper attempts to solve the following three main problems: 1. **Existing methods rely solely on cosine similarity or L2 distance for matching**: These methods fail to fully utilize all the correlations in the embedding information, thus limiting the model's performance. 2. **Single training objective**: Current methods mainly rely on contrastive learning as a single objective, which may lead to insufficient model training and potentially limit the model's ability to learn more diverse relationships from the samples. 3. **Insufficient diversity in training sample selection**: The diversity of the training samples used is insufficient and the learning difficulty is low, resulting in poor generalization performance and robustness of the model. To address these issues, the paper proposes the following core contributions: - **Multimodal encoder**: Utilizing a multimodal encoder for cross-modal learning, which can fuse facial and speech features and enhance cross-modal learning capabilities through a self-attention mechanism. In this way, the relationship between the two modalities can be learned more effectively. - **Hybrid training objectives**: Combining modality alignment and task learning, employing multiple similarity loss functions to achieve contrastive learning, while introducing a face-speech matching task to train the multimodal encoder. This allows the model to better understand the correspondence between modalities and directly apply to downstream tasks such as verification. - **Effective sample pairing selection method**: Proposing an effective method for selecting training sample pairs, including diversified positive sample selection and hard negative sample mining, thereby increasing the challenge of contrastive learning and face-speech matching tasks, further enhancing the model's ability to handle difficult cases and its generalization performance. Experimental results show that the proposed FAA framework achieves significant performance improvements in multiple tasks such as face-speech matching, verification, and retrieval, demonstrating clear advantages over existing methods.

Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Integration of multi-feature fusion and dictionary learning for face recognition

Learning Individual-Specific Dictionaries With Fused Multiple Features For Face Recognition

Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

Exploring Robust Face-Voice Matching in Multilingual Environments

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Audio-Visual Fusion Based on Interactive Attention for Person Verification

Coarse-to-fine Alignment Makes Better Speech-image Retrieval

Fine Alignment, Flexible Fusion: A Novel Framework of Multi-Model Score Fusion in Face Identification

Speech Fusion to Face: Bridging the Gap Between Human's Vocal Characteristics and Facial Imaging

Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units

Multimodal Channel-Mixing: Channel and Spatial Masked AutoEncoder on Facial Action Unit Detection

Cross-Modal Semantic Alignment before Fusion for Two-Pass End-to-End Spoken Language Understanding

Transformer Based Multi-model Fusion for 3D Facial Animation

Learning and Fusing Multimodal Features from and for Multi-task Facial Computing

Enhancing Modal Fusion by Alignment and Label Matching for Multimodal Emotion Recognition

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching