Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Bin Yu,Zhan Zhang,Ding Zhao,Yuehai Wang
DOI: https://doi.org/10.1109/icicsp55539.2022.10050611
2022-01-01
Abstract:In daily interactions, human speech perception is inherently a multi-modality process. Audio-visual speech enhancement (AV-SE) aims to aid speech enhancement with the help of visual information. However, the fusion strategy of most AV-SE approaches is too simple, resulting in the dominance of audio modality. The visual modality is usually ignored, especially when the signal-to-noise ratio (SNR) is medium or high. This paper proposes an encoder-decoder-based convolutional neural network of AV-SE with deep multi-modality fusion. The deep multi-modality fusion uses temporal attention to align multi-modality features selectively and preserves the temporal correlation by linear interpolation. The novel fusion strategy can take full advantage of video features, leading to a balanced multi-modality representation. To further improve the performance of AV-SE, mixed deep feature loss is introduced. Two neural networks are applied to model the characteristics of speech and noise signals, respectively. The experiment conducted on NTCD-TIMIT demonstrates the effectiveness of our proposed model. Compared to audio-only baseline and simple fusion approaches, our model achieves better performance in objective metrics under all SNR conditions.
What problem does this paper attempt to address?