Adversarial-Metric Learning for Audio-Visual Cross-Modal Matching

Aihua Zheng,Menglan Hu,Bo Jiang,Yan Huang,Yan Yan,Bin Luo
DOI: https://doi.org/10.1109/tmm.2021.3050089
IF: 7.3
2022-01-01
IEEE Transactions on Multimedia
Abstract:Audio-visual matching aims to learn the intrinsic correspondence between image and audio clip. Existing works mainly concentrate on learning discriminative features, while ignore the cross-modal heterogeneous issue between audio and visual modalities. To deal with this issue, we propose a novel Adversarial-Metric Learning (AML) model for audio-visual matching. AML aims to generate a modality-independent representation for each person in each modality via adversarial learning, while simultaneously learns a robust similarity measure for cross-modality matching via metric learning. By integrating the discriminative modality-independent representation and robust cross-modality metric learning into an end-to-end trainable deep network, AML can overcome the heterogeneous issue with promising performance for audio-visual matching. Experiments on the various audio-visual learning tasks, including audio-visual matching, audio-visual verification and audio-visual retrieval on benchmark dataset demonstrate the effectiveness of the proposed AML model. The implementation codes are available on https://github.com/MLanHu/AML.
computer science, information systems,telecommunications, software engineering
What problem does this paper attempt to address?