Abstract:The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field.

Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends

Recent Progresses in Deep Learning based Acoustic Models (Updated)

A Review of Deep Learning Based Speech Synthesis

Deep generative models for musical audio synthesis

A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis

A Survey of Deep Learning Audio Generation Methods

A Review of Deep Learning Techniques for Speech Processing

An Acoustic Model for English Speech Recognition Based on Deep Learning

Toward a Better Understanding of Deep Neural Network Based Acoustic Modelling: An Empirical Investigation

A Survey of Deep Learning Techniques in Speech Recognition

Transfer Learning Based Progressive Neural Networks for Acoustic Modeling in Statistical Parametric Speech Synthesis.

Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition: A comparison of current training strategies

Deep Learning for Speech Recognition: Review of State-of-the-Arts Technologies and Prospects

Employing Deep Learning Model to Evaluate Speech Information in Acoustic Simulations of Auditory Implants

Audio representations for deep learning in sound synthesis: A review

Acoustic statistical modeling based new generation speech synthesis technology

An Experimental Study on Speech Enhancement Based on Deep Neural Networks

Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview