Abstract:Voice-face association is generally specialized as a cross-modal cognitive matching problem, and recent attention has been paid on the feasibility of devising the computational mechanisms for recognizing such associations. Existing works are commonly resorting to the combination of contrastive learning and classification-based loss to correlate the heterogeneous datas. Nevertheless, the reliance on typical features of each category, known as archetypes, derived from the combination suffer from the weak invariance of modality-specific features within the same identity, which might induce a cross-modal joint feature space with calibration deviations. To tackle these problems, this paper presents an efficient Archetype-agnostic framework for reliable voice-face association. First, an Archetype-agnostic Subspace Merging (AaSM) method is carefully designed to perform feature calibration which can well get rid of the archetype dependence to facilitate the mutual perception of datas. Further, an efficient Bilateral Connection Re-gauging scheme is proposed to quantitatively screen and calibrate the biased datas, namely loose pairs that deviate from joint feature space. Besides, an Instance Equilibrium strategy is dynamically derived to optimize the training process on loose data pairs and significantly improve the data utilization. Through the joint exploitation of the above, the proposed framework can well associate the voice-face data to benefit various kinds of cross-modal cognitive tasks. Extensive experiments verify the superiorities of the proposed voice-face association framework and show its competitive performances with the state-of-the-arts.

Learning Discriminative Joint Embeddings for Efficient Face and Voice Association.

Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network

Detach and Enhance: Learning Disentangled Cross-modal Latent Representation for Efficient Face-Voice Association and Matching

Hearing Like Seeing

An Efficient Momentum Framework for Face-Voice Association Learning.

Taking a Part for the Whole: An Archetype-agnostic Framework for Voice-Face Association

Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

Contrastive Learning-based Chaining-Cluster for Multilingual Voice-Face Association

EFT: Expert Fusion Transformer for Voice-Face Association Learning.

Joint Structured Sparsity Regularized Multiview Dimension Reduction for Video-Based Facial Expression Recognition.

Learnable PINs: Cross-Modal Embeddings for Person Identity

Exploring Robust Face-Voice Matching in Multilingual Environments

Facial Representation Extraction by Mutual Information Maximization and Correlation Minimization.

Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement

Joint Learning for Face Alignment and Face Transfer with Depth Image

Self-Lifting: A Novel Framework for Unsupervised Voice-Face Association Learning

Enhancing 3d face recognition by combination of voiceprint

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

Learning Semantic Representations via Joint 3D Face Reconstruction and Facial Attribute Estimation

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition