VatLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning
Qiushi Zhu,Long Zhou,Ziqiang Zhang,Shujie Liu,Binxing Jiao,Jie Zhang,Lirong Dai,Daxin Jiang,Jinyu Li,Furu Wei
DOI: https://doi.org/10.1109/tmm.2023.3275873
IF: 7.3
2023-01-01
IEEE Transactions on Multimedia
Abstract:Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VatLM (Visual-Audio-Text Language Model). The proposed VatLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VatLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VatLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), and visual speech recognition (VSR) tasks. Results show that the proposed VatLM outperforms previous state-of-the-art models, such as the audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VatLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.
computer science, information systems,telecommunications, software engineering