An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Daniel Michelsanti,Zheng-Hua Tan,Shi-Xiong Zhang,Yong Xu,Meng Yu,Dong Yu,Jesper Jensen
DOI: https://doi.org/10.48550/arXiv.2008.09586
2021-03-13
Abstract:Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance.
Audio and Speech Processing,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use deep - learning techniques to combine audio and visual information to enhance or separate target speech signals in noisy or multi - source environments. Specifically, the paper focuses on two related tasks: audio - visual speech enhancement (AV - SE) and audio - visual speech separation (AV - SS). The goals of these tasks are to extract one or more target speech signals from the mixed sounds produced by multiple sound sources. Traditional solutions mainly rely on signal - processing and machine - learning techniques, but these methods usually only consider audio signals and ignore the value of visual information. Since visual information (such as the lip movements and facial expressions of speakers) is not affected by the acoustic environment, it can be used as effective auxiliary information to improve the performance of speech enhancement and separation. The paper points out that by using data - driven methods, especially deep learning, audio and visual information can be fused more effectively, thus achieving stronger performance. The paper also emphasizes a large number of feature - extraction and multi - modal information - fusion techniques proposed in recent years, indicating the need for a comprehensive overview and discussion of deep - learning - based audio - visual speech - enhancement and - separation research. In addition, the paper reviews audio - visual sound - source - separation methods for reconstructing speech and non - speech signals from silent videos, because these methods can be directly or indirectly applied to audio - visual speech enhancement and separation. Finally, the paper investigates commonly - used audio - visual speech data sets and evaluation methods, which are crucial for the development of data - driven methods and the comparison of the performance of different systems.