Emotilon-detect'ing Based Model Selectilon for Emotilonal Speech Recogn'it'ion
Yicheng Pan,Mingxing Xu,L. Q. Liul,Peng Jia
2006-01-01
Abstract:Asknowntoall, thepertormance otspeech recog- nition degrades dramatically inthepresence ofemotion. How todealwithemotion issue properly iscrucial. Mostwidely usedapproaches include robust feature extraction, speaker normalization andmodeltuning/retraining. Inthestudy, a novelmethodisproposed, thatis,adaptation technique is adopted totransform ageneral modelintoemotion specific onewitha smallamountofemotion speech. Moreover, a model-selection strategy basedonemotion-detection waspro- posed andproven tobeeffective, andtheoverall meanrecog- nition rateincreased to80.79% withanErrorRateReduction (ERR)of166.55% compared totheneutral speech Acoustic Model(AM). Keywords-speech recognition, emotional speech, adaptation, emotion-detection, model-selection I.INTRODUCTION Speech recognition gotits first jump-start inAT&T'sBell Labsin1936whenresearchers developed thefirst elec- tronic speech synthesizer (HUA01)Since thengreat breakthrough anddevelopment havebeenachieved inmany research fields ofspeech recognition, suchastheevolu tionfromthefirst isolated-word, speaker-dependent and small vocabulary system totoday's continuous, speaker- independent andinfinite vocabulary system. While inprac- tical application, theperformance ofspeech recognition system degrades dramatically asinfluenced bythefactors suchaschannel, environment noise, pronunciation, and emotion state. Howtoreduce theinfluences becomes hot. Someworkhasbeendoneonchannel, noise, pronuncia- tionetcbutcomparatively less onemotion Withtherapid developmenit inhuman-computer interac- tion, theresearch onaffect andaffective computing are being paid muchmoreattention (PIC 98)Someachieve mentshave been madeinthefields offacial emotion, body gesture etcAsoneofthemostimportant meansbywhich people communicate, speech converts notonlyverbal but also emotion information. Sotheultimate goal ofspeech recognition should betoidentify bothverbal content and emotion hints. However, mostresearch onspeech andemo- tion hasfocused onrecognizing theemotion. Itisreported that readspeech andnon-read speech produced inacare- ful style canachieve theaccuracy ofabout95%0but it's still farfromtheultimate goal ofrecovering free conver- sational speech uttered byanyspeaker inanyenvironment (ATH05). Experiment showsthat it's particularly difficult torecover verbal content whenthespeaker pronounces in anemotional way.Thisproblem isthemainfocus ofthis paper. Thesolution tothedistortion speech problem canusually becategorized into three classes frombottom totoplevel. (1) Feature level: Thisapproach aimstofind morerobust acoustic features ortocompensate fortheeffect ofdistor- tion during recognition testing phase (e.g., formant location andbandwidth stress equalization) (BOU00), (HAN96) (2) AM leveltuning AM training method tomakeAM modelbemorematchtodistorted speech. Somespeech recognition studies haveadapted therecognizer tothein putdistorted speech during training andothers havealso considered, alternative training methods suchasmulti-style training (LIP 87). (3) Language Modellevel (LM): adding somehigh level knowledge clues toLM TAthanaselis im proved therecognition rateforspontaneous emotionally colored speech byusing alanguage modelbased onin- creased representation ofemotional utterances (ATH05). Adaptation techniques canbeusedtomodify system pa- rameters tobetter matchvariations inmicrophone, trans- mission channel, environment noise, speaker, style, and, ap- plication exists (LI05). Inthis paper this teclnique was usedtobuild several emotion-dependent