Vietnamese Speaker Verification With Mel-Scale Filter Bank Energies and Deep Learning
Thi-Thanh-Mai Nguyen,Duc-Dung Nguyen,Chi-Mai Luong
DOI: https://doi.org/10.1109/access.2024.3479092
IF: 3.9
2024-10-19
IEEE Access
Abstract:Mel-Frequency Cepstral Coefficients (MFCCs) have been extensively used as input for many traditional and modern speech processing systems. The power of MFCCs lies in the compact representation of speech signals, which is capable of capturing the essential phonetic content of the speech. However, most of the MFCC energy concentrates on the low-order coefficients, and the flat distribution of high-order MFCC values makes convolutional operators less sensitive to the transient details of the coefficients, which may be important in certain speech processing tasks like speaker recognition. In this paper, we analyze the differences between Mel-scale filter bank energies (MFBEs) and MFCCs, and we show that MFBEs are more effective inputs for deep learning-based Vietnamese speaker verification. MFBEs help deep learning models learn a better speaker representation with a more compact distribution of embedding vectors. Experiments on two Vietnamese speaker verification datasets show that the MFBEs consistently outperform MFCCs in improving the performance of some state-of-the-art deep learning models. The equal error rate (EER) on the Vietnam-Celeb test dataset was reduced by 1.14% with the ResNetSE-34 model and 2.36%, or 51.6% improvement, on the VLSP2021 test dataset with ECAPA-TDNN model and transfer learning.
computer science, information systems,telecommunications,engineering, electrical & electronic