Abstract:One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.

Improved Speech Recognition Using Discriminative Integration of Multiple Local Classifiers in Lattice Rescoring

Improvement Comparison of Different Lattice-based Discriminative Training Methods in Chinese-monolingual and Chinese-English-bilingual Speech Recognition

Integrating Lattice-Free MMI into End-to-End Speech Recognition

Combining Hybrid DNN-HMM ASR Systems with Attention-Based Models Using Lattice Rescoring

Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring

Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

Improved lattice-based spoken document retrieval by directly learning from the evaluation measures

Mandarin-English bilingual phone modeling and combining MPE based Discriminative training for cross-language speech recognition

Improving Accented Mandarin Speech Recognition by Using Recurrent Neural Network Based Language Model Adaptation

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition

Discriminative Score Fusion for Language Identification

An Improved Linear Discriminant Analysis for Mandarin Digit Speech Recognition

Improved Phonotactic Language Recognition Using Collaborated Language Model.

Improved spoken term detection using support vector machines based on lattice context consistency

Enhancing CTC-based speech recognition with diverse modeling units

Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion

High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model

Improved context-dependent acoustic modeling for continuous Chinese speech recognition

Improved BLSTM RNN Based Accent Speech Recognition Using Multi-task Learning and Accent Embeddings

Phone modeling and combining discriminative training for Mandarin-English bilingual speech recognition