Parameterization of Dominant Spectral Peak Trajectory for Whisper Speech Recognition

Chang Feng,Xiaolong Wu,Mingxing Xu,Thomas Fang Zheng
DOI: https://doi.org/10.23919/apsipaasc55919.2022.9980259
2022-01-01
Abstract:Automatic speech recognition (ASR) systems trained on normal speech generally suffer from performance degradations for whisper speech. To solve this problem, this paper concentrates on utilizing similar factors between normal and whisper speech to construct a whisper speech recognizer with normal speech data. We propose to parameterize the dominant spectral peak trajectory (Ppeak) to capture the similarities and concatenate it to the traditional Mel-Frequency Cepstral Coefficients (MFCC) and Human Factor Cepstral Coefficients (HFCC), respectively, to form new features. The proposed features benefit to the accuracy of whisper speech recognition. Performance improvement can be further achieved when the similarity is enhanced by removing low frequency information. Experimental results show that the performance degradation between match and mismatch scenarios was reduced relatively by 90.31% in Word Error Rate (WER) for HFCC after similarity enhancement at a cut-off frequency of 500Hz. Furthermore, we ultimately achieved a relative reduction of 69.60% in WER in the mismatch scenario compared with conventional MFCC even without whisper speech data for training.
What problem does this paper attempt to address?