Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Chong Peng,Liqiang He,Dan Su
2024-04-15
Abstract:Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task. Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks, improving verification by approximately 3%, matching by about 2.5%, and retrieval by around 1.3%.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing several key issues in the learning of associations between speech and faces, and proposes a new framework—Fuse after Align (FAA)—to improve the associative learning between speech and faces. Specifically, the paper attempts to solve the following three main problems: 1. **Existing methods rely solely on cosine similarity or L2 distance for matching**: These methods fail to fully utilize all the correlations in the embedding information, thus limiting the model's performance. 2. **Single training objective**: Current methods mainly rely on contrastive learning as a single objective, which may lead to insufficient model training and potentially limit the model's ability to learn more diverse relationships from the samples. 3. **Insufficient diversity in training sample selection**: The diversity of the training samples used is insufficient and the learning difficulty is low, resulting in poor generalization performance and robustness of the model. To address these issues, the paper proposes the following core contributions: - **Multimodal encoder**: Utilizing a multimodal encoder for cross-modal learning, which can fuse facial and speech features and enhance cross-modal learning capabilities through a self-attention mechanism. In this way, the relationship between the two modalities can be learned more effectively. - **Hybrid training objectives**: Combining modality alignment and task learning, employing multiple similarity loss functions to achieve contrastive learning, while introducing a face-speech matching task to train the multimodal encoder. This allows the model to better understand the correspondence between modalities and directly apply to downstream tasks such as verification. - **Effective sample pairing selection method**: Proposing an effective method for selecting training sample pairs, including diversified positive sample selection and hard negative sample mining, thereby increasing the challenge of contrastive learning and face-speech matching tasks, further enhancing the model's ability to handle difficult cases and its generalization performance. Experimental results show that the proposed FAA framework achieves significant performance improvements in multiple tasks such as face-speech matching, verification, and retrieval, demonstrating clear advantages over existing methods.