Voice Signal Processing for Machine Learning. The Case of Speaker Isolation

Radan Ganchev
2024-03-29
Abstract:The widespread use of automated voice assistants along with other recent technological developments have increased the demand for applications that process audio signals and human voice in particular. Voice recognition tasks are typically performed using artificial intelligence and machine learning models. Even though end-to-end models exist, properly pre-processing the signal can greatly reduce the complexity of the task and allow it to be solved with a simpler ML model and fewer computational resources. However, ML engineers who work on such tasks might not have a background in signal processing which is an entirely different area of expertise.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper primarily explores the application of speech signal processing in machine learning (ML), with a particular focus on the speaker separation issue. Its core objective is to compare and analyze two common signal decomposition methods, the Fourier Transform (FT) and the Wavelet Transform (WT), to assist machine learning engineers in selecting, fine-tuning, and evaluating the most suitable signal decomposition configuration for specific ML models. The paper also discusses speech intelligibility assessment metrics, such as the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), Perceptual Evaluation of Speech Quality (PESQ), and Short-Time Objective Intelligibility (STOI), which are crucial for measuring the performance of speech processing tasks. The paper points out that when dealing with non-periodic speech signals, the standard Fourier Transform may not be effective enough, hence the need to use the Short-Time Fourier Transform (STFT) or Wavelet Transform, as they provide localized information in both frequency and time. Additionally, the paper emphasizes the importance of understanding the frequency range of human speech for selecting or fine-tuning the frequency decomposition method for specific speech processing tasks. The experimental design section of the paper addresses the speaker separation issue, exploring the input data, experimental procedures, decomposition parameters, and evaluation methods for different decomposition configurations. The results and analysis section delves into the effects of different settings, aiming to find the best solution for the speaker separation problem. In summary, through theoretical analysis and experimental validation, the paper aims to provide a comprehensive and practical guide for engineers in the field of machine learning, helping them make informed decisions when processing speech signals, especially in complex tasks such as speaker separation.