Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network

Kai Cheng,Xin Liu,Yiu-ming Cheung,Rui Wang,Xing Xu,Bineng Zhong
DOI: https://doi.org/10.1145/3394171.3413710
2020-01-01
Abstract:Many cognitive researches have shown that human may 'see voices' or 'hear faces', and such ability can be potentially associated by machine vision and intelligence. However, this research is still under early stage. In this paper, we present a novel adversarial deep semantic matching network for efficient voice-face interactions and associations, which can well learn the correspondence between voices and faces for various cross-modal matching and retrieval tasks. Within the proposed framework, we exploit a simple and efficient adversarial learning architecture to learn the cross-modal embeddings between faces and voices, which consists of two subnetworks, respectively, for generator and discriminator. The former subnetwork is designed to adaptively discriminate the high-level semantical features between voices and faces, in which the triplet loss and multi-modal center loss are in tandem utilized to explicitly regularize the correspondences among them. The latter subnetwork is further leveraged to maximally bridge the semantic gap between the representations of voice and face data, featuring on maintaining the semantic consistency. Through the joint exploitation of the above, the proposed framework can well push representations of voice-face data from the same person closer while pulling those representations of different person away. Extensive experiments empirically show that the proposed approach involves fewer parameters and calculations, adapts various cross-modal matching tasks for voice-face data and brings substantial improvements over the state-of-the-art methods.
What problem does this paper attempt to address?