CATNet: Cross-modal fusion for audio-visual speech recognition
Xingmei Wang,Jiachen Mi,Boquan Li,Yixu Zhao,Jiaxiang Meng
DOI: https://doi.org/10.1016/j.patrec.2024.01.002
IF: 4.757
2024-01-10
Pattern Recognition Letters
Abstract:Automatic speech recognition (ASR) is a typical pattern recognition technology that converts human speeches into texts. With the aid of advanced deep learning models, the performance of speech recognition is significantly improved. Especially, the emerging Audio-Visual Speech Recognition (AVSR) methods achieve satisfactory performance by combining audio-modal and visual-modal information. However, various complex environments, especially noises, limit the effectiveness of existing methods. In response to the noisy problem, in this paper, we propose a novel cross-modal audio-visual speech recognition model, named CATNet . First, we devise a cross-modal bidirectional fusion model to analyze the close relationship between audio and visual modalities. Second, we propose an audio-visual dual-modal network to preprocess audio and visual information, extract significant features and filter redundant noises. The experimental results demonstrate the effectiveness of CATNet , which achieves excellent WER, CER and converges speeds, outperforms other benchmark models and overcomes the challenge posed by noisy environments.
computer science, artificial intelligence