Abstract:Asknowntoall, thepertormance otspeech recog- nition degrades dramatically inthepresence ofemotion. How todealwithemotion issue properly iscrucial. Mostwidely usedapproaches include robust feature extraction, speaker normalization andmodeltuning/retraining. Inthestudy, a novelmethodisproposed, thatis,adaptation technique is adopted totransform ageneral modelintoemotion specific onewitha smallamountofemotion speech. Moreover, a model-selection strategy basedonemotion-detection waspro- posed andproven tobeeffective, andtheoverall meanrecog- nition rateincreased to80.79% withanErrorRateReduction (ERR)of166.55% compared totheneutral speech Acoustic Model(AM). Keywords-speech recognition, emotional speech, adaptation, emotion-detection, model-selection I.INTRODUCTION Speech recognition gotits first jump-start inAT&T'sBell Labsin1936whenresearchers developed thefirst elec- tronic speech synthesizer (HUA01)Since thengreat breakthrough anddevelopment havebeenachieved inmany research fields ofspeech recognition, suchastheevolu tionfromthefirst isolated-word, speaker-dependent and small vocabulary system totoday's continuous, speaker- independent andinfinite vocabulary system. While inprac- tical application, theperformance ofspeech recognition system degrades dramatically asinfluenced bythefactors suchaschannel, environment noise, pronunciation, and emotion state. Howtoreduce theinfluences becomes hot. Someworkhasbeendoneonchannel, noise, pronuncia- tionetcbutcomparatively less onemotion Withtherapid developmenit inhuman-computer interac- tion, theresearch onaffect andaffective computing are being paid muchmoreattention (PIC 98)Someachieve mentshave been madeinthefields offacial emotion, body gesture etcAsoneofthemostimportant meansbywhich people communicate, speech converts notonlyverbal but also emotion information. Sotheultimate goal ofspeech recognition should betoidentify bothverbal content and emotion hints. However, mostresearch onspeech andemo- tion hasfocused onrecognizing theemotion. Itisreported that readspeech andnon-read speech produced inacare- ful style canachieve theaccuracy ofabout95%0but it's still farfromtheultimate goal ofrecovering free conver- sational speech uttered byanyspeaker inanyenvironment (ATH05). Experiment showsthat it's particularly difficult torecover verbal content whenthespeaker pronounces in anemotional way.Thisproblem isthemainfocus ofthis paper. Thesolution tothedistortion speech problem canusually becategorized into three classes frombottom totoplevel. (1) Feature level: Thisapproach aimstofind morerobust acoustic features ortocompensate fortheeffect ofdistor- tion during recognition testing phase (e.g., formant location andbandwidth stress equalization) (BOU00), (HAN96) (2) AM leveltuning AM training method tomakeAM modelbemorematchtodistorted speech. Somespeech recognition studies haveadapted therecognizer tothein putdistorted speech during training andothers havealso considered, alternative training methods suchasmulti-style training (LIP 87). (3) Language Modellevel (LM): adding somehigh level knowledge clues toLM TAthanaselis im proved therecognition rateforspontaneous emotionally colored speech byusing alanguage modelbased onin- creased representation ofemotional utterances (ATH05). Adaptation techniques canbeusedtomodify system pa- rameters tobetter matchvariations inmicrophone, trans- mission channel, environment noise, speaker, style, and, ap- plication exists (LI05). Inthis paper this teclnique was usedtobuild several emotion-dependent

Cost-Sensitive Learning for Emotion Robust Speaker Recognition

Emotional Speech Clustering Based Robust Speaker Recognition System

Pitch envelope based frame level score reweighed algorithm for emotion robust speaker recognition.

Simplified Deformation Compensation for Emotional Speaker Recognition

Emotion-State conversion for speaker recognition

Affect-Insensitive Speaker Recognition by Feature Variety Training

Applying difference detection and pruning to emotional speaker recognition

Emotional Speaker Identification By Humans And Machines

Toward emotional speaker recognition: framework and preliminary results

Scores Selection for Emotional Speaker Recognition

Emotional speaker recognition based on similar neighbor phenomenon

Affect-insensitive Speaker Recognition Systems Via Emotional Speech Clustering Using Prosodic Features

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

Learning Polynomial Function Based Neutral-Emotion Gmm Transformation For Emotional Speaker Recognition

Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Real-time Speech Emotion Recognition Based on Syllable-Level Feature Extraction

Speech Emotion Recognition Based on Linear Discriminant Analysis and Support Vector Machine Decision Tree

Speaker-Independent Speech Emotion Recognition Based On Cnn-Blstm And Multiple Svms

Speech Emotion Recognition Based on Feature Selection and Extreme Learning Machine Decision Tree

SEC-GAN for robust speaker recognition with emotional state dismatch

Emotilon-detect'ing Based Model Selectilon for Emotilonal Speech Recogn'it'ion