Abstract:Speaker recognition (SRE), also called as voiceprint recognition, is the problem of determining the identity of the speaker from a sample of speech signal. It is an important branch of speech signal processing and has many potential applications such as in telephone banking, access control, information security, law enforcement and other forensic applications (Bimbot et al., 2004; Campbell Jr., 1997; Cole et al., 1997; Kinnunen & Li, 2010; Reynolds, 2002). Compared with other biometrics techniques, speaker recognition has its own advantages: (1) It is very convenient, natural and low-cost to acquire the speech sample: it does not need the special devices; the telephone, mobile phone or ordinary microphone is adequate. (2) It can be used remotely: with the ubiquitous telecommunications networks and the Internet, the speech sample can be easily transferred through telephone or VoIP, which makes the remote recognition possible. (3) The speech sample contains many inborn characters: from the speech, we can extract some information about vocal tract, mouth, tongue, soft palate, nasal cavity, and etc. (4) The speech sample also contains some acquired characters, such as tone, volume, pace, rhythm, rhetoric, which reflect speaker’s place of living, education level, and some personal habits information. In speaker recognition, the Gaussian mixture model universal background model (GMM-UBM) is a classical yet widely used method for text-independent speaker verification (Reynolds et al., 2000). In this method, the target speaker is modeled as a GMM and the imposters are modeled as a UBM. When testing, the speech sample is scored as likelihood by the GMM and UBM respectively, and then the likelihood ratio hypothesis test is used for speaker verification. Besides the GMM-UBM, several other methods are developed recently. The most successful ones include the support vector machine using GMM supper vector (GSV-SVM) (Campbell et al., 2006), which concatenate the GMM mean vectors as the input for SVM training and test, and joint factor analysis (JFA) (Kenny et al., 2007), which jointly models the channel subspace and the speaker subspace. Although other methods achieve rapid progress, GMM-UBM is still the basis for their developments. As the meanwhile, the discriminative technologies, such as minimum classification error (MCE), maximum mutual information (MMI), minimum phone error (MPE), feature domain MPE (fMPE), have been achieved great success in speech recognition and language recognition (Burget et al., 2006; Juang & Katagiri, 1992; Povey & Kingsbury, 2007; Woodland & Povey, 2002). 12

Mandarin-English bilingual phone modeling and combining MPE based Discriminative training for cross-language speech recognition

Phone modeling and combining discriminative training for Mandarin-English bilingual speech recognition

Improvement Comparison of Different Lattice-based Discriminative Training Methods in Chinese-monolingual and Chinese-English-bilingual Speech Recognition

Development of A Mandarin-English Bilingual Speech Recognition System for Real World Music Retrieval

Towards Language-Universal Mandarin-English Speech Recognition

Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion

Multi-task Recurrent Model for True Multilingual Speech Recognition

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

Reliable Accent-Specific Unit Generation With Discriminative Dynamic Gaussian Mixture Selection for Multi-Accent Chinese Speech Recognition

Integration Of Complementary Phone Recognizers For Phonotactic Language Recognition

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Monolingual Recognizers Fusion for Code-switching Speech Recognition

Multilingual Approach to Joint Speech and Accent Recognition with DNN-HMM Framework

Integrating Lattice-Free MMI into End-to-End Speech Recognition

Discriminative Boosting Algorithm for Diversified Front-End Phonotactic Language Recognition

Discriminative Dynamic Gaussian Mixture Selection with Enhanced Robustness and Performance for Multi-Accent Speech Recognition

Reliable accent specific unit generation with dynamic Gaussian mixture selection for multi-accent speech recognition

Tri-stage training with language-specific encoder and bilingual acoustic learner for code-switching speech recognition

BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition

Discriminative Universal Background Model Training for Speaker Recognition

Time-Frequency Cepstral Features and Combining Discriminative Training for Phonotactic Language Recognition