Speech Bandwidth Expansion Via High Fidelity Generative Adversarial Networks

Mahmoud Salhab,Haidar Harmanani
2024-07-29
Abstract:Speech bandwidth expansion is crucial for expanding the frequency range of low-bandwidth speech signals, thereby improving audio quality, clarity and perceptibility in digital applications. Its applications span telephony, compression, text-to-speech synthesis, and speech recognition. This paper presents a novel approach using a high-fidelity generative adversarial network, unlike cascaded systems, our system is trained end-to-end on paired narrowband and wideband speech signals. Our method integrates various bandwidth upsampling ratios into a single unified model specifically designed for speech bandwidth expansion applications. Our approach exhibits robust performance across various bandwidth expansion factors, including those not encountered during training, demonstrating zero-shot capability. To the best of our knowledge, this is the first work to showcase this capability. The experimental results demonstrate that our method outperforms previous end-to-end approaches, as well as interpolation and traditional techniques, showcasing its effectiveness in practical speech enhancement applications.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **Speech Bandwidth Expansion (BWE)**. Specifically, its goal is to convert low - bandwidth speech signals into high - bandwidth speech signals, thereby improving the quality, clarity and audibility of the audio. This process is also known as **audio super - resolution**, that is, generating high - resolution speech signals from low - resolution inputs. #### Background and Significance 1. **Wide - ranging applications**: Speech bandwidth expansion has important applications in multiple fields, including: - **Telephone communication**: Improving the voice quality in the Public Switched Telephone Network (PSTN). - **Compression technology**: Maintaining high - quality voice during the compression process. - **Text - to - speech synthesis**: Generating more natural and clearer voices. - **Speech recognition**: Improving the performance of automatic speech recognition systems. 2. **Existing challenges**: - Many devices still use narrow - band speech signals, such as Bluetooth headsets. - Narrow - band speech signals can reduce the performance of automatic speech recognition systems. - Traditional methods (such as interpolation methods) have limited effectiveness when dealing with different bandwidth expansion ratios. #### Main contributions of the paper 1. **Proposing a new method**: This paper proposes a new method based on high - fidelity Generative Adversarial Networks (GANs) for speech bandwidth expansion. Unlike traditional cascade systems, this method adopts an end - to - end training method, and only one model is required to complete training and inference. 2. **Unified model**: Introduced a unified model that can handle different bandwidth up - sampling ratios without having to train a model separately for each ratio. 3. **Zero - sample ability**: Demonstrated the generalization ability of the model on unseen bandwidth expansion ratios, which is the first work to achieve such zero - sample ability. 4. **Experimental verification**: Through experimental verification, this method outperforms existing end - to - end methods and traditional techniques on multiple bandwidth expansion ratios. #### Mathematical representation According to the Nyquist - Shannon sampling theorem, when a signal is sampled at a rate of \( F_s \) Hz, the maximum bandwidth \( B \) without aliasing is guaranteed as: \[ B=\frac{F_s}{2} \] In order to expand the bandwidth \( B \) by \( s \) times, the sampling rate also needs to be expanded by \( s \) times. For a narrow - band speech signal \( x_{\text{low}} \) with a sampling rate of \( F_{\text{low}} \) and a bandwidth of \( B_{\text{low}} \), the sampling rate and bandwidth of the broadband speech signal \( x_{\text{high}} \) obtained after bandwidth expansion are respectively: \[ F_{\text{high}} = s\times F_{\text{low}} \] \[ B_{\text{high}} = s\times B_{\text{low}} \] \[ |x_{\text{high}}| = s\times |x_{\text{low}}| \] where \( |x| \) represents the length of the signal, and \( s \) is the up - sampling ratio. In summary, this paper solves multiple challenges in speech bandwidth expansion by introducing a high - fidelity GAN - based model and demonstrates its superior performance in practical applications.