Speaking Rate Normalization With Lattice-Based Context-Dependent Phoneme Duration Modeling For Personalized Speech Recognizers On Mobile Devices

Ching-Feng Yeh,Hung-Yi Lee,Lin-Shan Lee
DOI: https://doi.org/10.21437/interspeech.2013-433
2013-01-01
Abstract:Voice access of cloud applications including social networks using mobile devices becomes attractive today. And personalized speech recognizers over mobile devices become feasible because most mobile devices have only a single user. Speaking rate variation is known to be an important source of performance degradation for spontaneous speech recognition. Speaking rate is speaker dependent, it changes from time to time for every speaker. Furthermore, the speaking rate variation pattern is unique for each speaker. An approach of continuous frame rate normalization (CFRN) [1] was recently proposed to take care of the speaking rate variation problem. In this paper, we further proposed an extended version of CFRN for personalized speech recognizers on mobile platforms. In this approach, we use context-dependent phoneme duration models adapted to each speaker to estimate the speaking rate utterance by utterance based on lattices obtained with a first pass recognizer. The proposed approach was evaluated on both read speech and spontaneous recordings from mobile platforms and significant improvement were observed in the experimental result.
What problem does this paper attempt to address?