Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Bin Li,Zhongwei Wu,Yuehai Wang
DOI: https://doi.org/10.1109/icpics55264.2022.9873574
2022-01-01
Abstract:The proposal of audio-visual speech recognition (AVSR) combining acoustic and visual features has caused heated debates with the growth of online conferences. Unfortunately, due to the fact that audio and visual information usually affect each other, the recognition rate cannot improve. In this paper, we describe an improved audio-visual speech recognition network called AVSR-SA-MASK which does well in fusing. Therefore, this method is robust to low SNR scenes and face occlusion scenes. The mask fusion module of AVSR-SA-MASK can capture the correlations along primitive and extracted dimensions for both the audio and the visual. Instead of utilizing single loss at the training stage, we also introduce a balanced loss to improve modal identification abilities. In TCD-TIMIT and LRS2, the experiment results show that char error rate (CER) is improved by 15% to 31% relatively.
What problem does this paper attempt to address?