Speaker identification and localization using shuffled MFCC features and deep learning

Mahdi Barhoush,Ahmed Hallawa,Anke Schmeink
DOI: https://doi.org/10.1007/s10772-023-10023-2
2023-01-30
International Journal of Speech Technology
Abstract:The use of machine learning in automatic speaker identification and localization systems has recently seen significant advances. However, this progress comes at the cost of using complex models, computations, and increasing the number of microphone arrays and training data. Therefore, in this work, we propose a new end-to-end identification and localization model based on a simple fully connected deep neural network (FC-DNN) and just two input microphones. This model can jointly or separately localize and identify an active speaker with high accuracy in single and multi-speaker scenarios by exploiting a new data augmentation approach. In this regard, we propose using a novel Mel Frequency Cepstral Coefficients (MFCC) based feature called Shuffled MFCC (SHMFCC) and its variant Difference Shuffled MFCC (DSHMFCC). In order to test our approach, we analyzed the performance of the identification and localization proposed model on the new features at different noise and reverberation conditions for single and multi-speaker scenarios. The results show that our approach achieves high accuracy in these scenarios, outperforms the baseline and conventional methods, and achieves robustness even with small-sized training data.
What problem does this paper attempt to address?