Multi-Task Joint Learning for Embedding Aware Audio-Visual Speech Enhancement

Chenxi Wang,Hang Chen,Jun Du,Baocai Yin,Jia Pan
DOI: https://doi.org/10.1109/ISCSLP57327.2022.10038268
2022-01-01
Abstract:In this paper, we propose a multi-task joint learning scheme to improve embedding aware audio-visual speech enhancement by adopting the phone and the articulation place together as the classification targets during the training of embedding extractor and enhancement network. Firstly, the multimodal embedding is extracted from noisy speech and lip frames, and supervised by the articulation place and the phone label levels together. Next, we train the embedding extractor and enhancement network jointly where the learning objects include the ideal ratio mask, the phone posteriori and the place posteriori. Experiments on the TCD-TIMIT corpus corrupted by simulated additive noises show that the proposed multimodal embedding at the multi-scale class level is more effective than the previous embedding at the place/phone level and the multi-task based joint learning framework further improves speech quality and intelligibility.
What problem does this paper attempt to address?