Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification

Meng Liu,Kong Aik Lee,Longbiao Wang,Hanyi Zhang,Chang Zeng,Jianwu Dang
DOI: https://doi.org/10.1109/icassp49357.2023.10095883
2023-01-01
Abstract:Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the test scenarios demonstrate that our proposed method achieves around 60% and 20% average relative performance improvement over baseline unimodal and fusion systems, respectively.
What problem does this paper attempt to address?