Affect-insensitive Speaker Recognition Systems Via Emotional Speech Clustering Using Prosodic Features

Dongdong Li,Yubo Yuan,Zhaohui Wu,Yingchun Yang
DOI: https://doi.org/10.1007/s00521-014-1708-8
2014-01-01
Abstract:Voice-based biometric security systems involving only neutral speech have achieved promising performance. However, the speakers are very likely to fail the recognition when the test data exhibit multiple emotions. This paper aimed to address the mismatch of the emotional states between training and testing speech. We discuss different modeling strategies that incorporate the emotions (affects) of speakers into the training stage of a Mandarin-based speaker recognition system and propose an alternative approach, which could optimize the utilization of the limited affective speech. The training speeches are partitioned and clustered by the trends of the prosodic variations. Multiple models are built based on the clustered speech for a given speaker. The prosodic differences are characterized by a combination of features that describe the changes of the fundamental frequencies and energy contours. The experiments were carried out based on the Mandarin Affective Speech Corpus. The result shows 73.37 % improvement in recognition rate over that of the traditional speaker verification tasks relatively and also achieves 63.53 % higher in performance over the structural training-based systems relatively.
What problem does this paper attempt to address?