An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Daniel Michelsanti,Zheng-Hua Tan,Shi-Xiong Zhang,Yong Xu,Meng Yu,Dong Yu,Jesper Jensen

DOI: https://doi.org/10.48550/arXiv.2008.09586

2021-03-13

Abstract:Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance.

Audio and Speech Processing,Machine Learning,Image and Video Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use deep - learning techniques to combine audio and visual information to enhance or separate target speech signals in noisy or multi - source environments. Specifically, the paper focuses on two related tasks: audio - visual speech enhancement (AV - SE) and audio - visual speech separation (AV - SS). The goals of these tasks are to extract one or more target speech signals from the mixed sounds produced by multiple sound sources. Traditional solutions mainly rely on signal - processing and machine - learning techniques, but these methods usually only consider audio signals and ignore the value of visual information. Since visual information (such as the lip movements and facial expressions of speakers) is not affected by the acoustic environment, it can be used as effective auxiliary information to improve the performance of speech enhancement and separation. The paper points out that by using data - driven methods, especially deep learning, audio and visual information can be fused more effectively, thus achieving stronger performance. The paper also emphasizes a large number of feature - extraction and multi - modal information - fusion techniques proposed in recent years, indicating the need for a comprehensive overview and discussion of deep - learning - based audio - visual speech - enhancement and - separation research. In addition, the paper reviews audio - visual sound - source - separation methods for reconstructing speech and non - speech signals from silent videos, because these methods can be directly or indirectly applied to audio - visual speech enhancement and separation. Finally, the paper investigates commonly - used audio - visual speech data sets and evaluation methods, which are crucial for the development of data - driven methods and the comparison of the performance of different systems.

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Audio-Visual Speech Enhancement with Deep Multi-modality Fusion

Deep Learning for Visual Speech Analysis: A Survey

Supervised Speech Separation Based on Deep Learning: An Overview

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Audiovisual Singing Voice Separation

An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Audio-visual multi-channel speech separation, dereverberation and recognition

End-to-End Audiovisual Fusion with LSTMs

Audio–Visual Deep Clustering for Speech Separation

Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions

An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

Speaker recognition based on deep learning: An overview

Joint Training or Not: An Exploration of Pre-trained Speech Models in Audio-Visual Speaker Diarization

Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect

Look&listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

An Overview of Visual Sound Synthesis Generation Tasks Based on Deep Learning Networks

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention