The XMU System for Audio-Visual Diarization and Recognition in MISP Challenge 2022.

Gaopeng Xu,Xianliang Wang,Sang Wang,Junfeng Yuan,Wei Guo,Wei Li,Jie Gao
DOI: https://doi.org/10.1109/icassp49357.2023.10095693
2023-01-01
Abstract:In this paper, we present our work in track 2 of the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. We built a cascaded system and explored different acoustic front-ends and end-to-end speech recognition back-ends based on multimodal. To promote effective fusion between the different modalities, we introduced a multi-level feature fusion network. By utilizing several additional strategies, our system achieved 31.88% in the concatenated minimum permutation character error rate (cpCER) on the evaluation set, achieving the 3th place ranking in the competition.
What problem does this paper attempt to address?