Abstract:Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope (QSE). The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients (MFCC), a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy (LFBE) features trained on long short-term memory (LSTM) network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.

Whisper to Normal Speech Based on Deep Neural Networks with MCC and F0 Features.

Attention-Guided Generative Adversarial Network for Whisper to Normal Speech Conversion

Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech

End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network

Parameterization of Dominant Spectral Peak Trajectory for Whisper Speech Recognition

Deep Neural Network Based Noised Asian Speech Enhancement and Its Implementation on a Hearing Aid App.

Deep Neural Network Based Voice Conversion with A Large Synthesized Parallel Corpus

MaskCycleGAN-based Whisper to Normal Speech Conversion

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions

Spectral Conversion Using Deep Neural Networks Trained with Multi-Source Speakers

Whisper-to-speech Conversion Using Restricted Boltzmann Machine Arrays

Convolutional Maxout Neural Networks for Speech Separation

Modeling F0 Trajectories in Hierarchically Structured Deep Neural Networks.

Noise-robust voice conversion using adversarial training with multi-feature decoupling

A noise-robust voice conversion method with controllable background sounds

Robust F0 Modeling for Mandarin Speech Recognition in Noise.

DMF-Net: A decoupling-style multi-band fusion model for full-band speech enhancement

Voice Conversion Using Deep Neural Networks with Layer-Wise Generative Training

End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training

A regression approach to speech enhancement based on deep neural networks

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models