Abstract:Conventionally, the extraction of hand-crafted acoustic features has been separated from the task of establishing robust machine-learning models in speech processing. The manual approach of feature engineering is both time-consuming and necessitates specialist knowledge, posing significant hindrances. Moreover, the resulting features may not be ideal for the desired application. The speech community has adopted raw waveform modeling to enhance performance. These techniques learn an optimized representation of the input automatically. With deep learning (DL) advancements, raw waveform modeling has become valuable for tasks like classification and prediction. The primary aim of this survey is to offer valuable insights and fills a gap in the existing literature by providing a comprehensive review of the state-of-the-art in speech and speaker recognition using raw waveform modeling for both adult and children's speech. The article covers papers from 2013–2023 and is the first to review both adult and children's databases. The article focuses on the advantages of raw waveform models. It presents essential concepts and techniques while discussing the challenges and limitations of using raw waveforms for speech and speaker recognition in both adult and children's speech. The article also evaluates recent progress in DL architectures such as SincNet, ResNet, and RawNet, and outlines future research directions in the field.

Speech and speaker recognition using raw waveform modeling for adult and children's speech: A comprehensive review