Abstract:One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.

Multi-feature Combination for Speaker Recognition

Multi-resolution Time Frequency Feature and Complementary Combination for Short Utterance Speaker Recognition

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Variant Time-Frequency Cepstral Features for Speaker Recognition

Combining Speech Enhancement and Discriminative Feature Extraction for Robust Speaker Recognition

Short Utterance Speaker Recognition Based on Speech High Frequency Information Compensation and Dynamic Feature Enhancement Methods

Methods of Combining Multiple Classifiers with Different Features and Their Applications to Text-Independent Speaker Identification

Speaker Recognition Using DMFCC over Telephone Channels

Combining Information from Multi-Stream Features Using Deep Neural Network in Speech Recognition

Time–Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition

A Fusion Approach to Spoken Language Identification Based on Combining Multiple Phone Recognizers and Speech Attribute Detectors

Time-Frequency Cepstral Features and Combining Discriminative Training for Phonotactic Language Recognition

Fusion of deep shallow features and models for speaker recognition

System Combination for Short Utterance Speaker Recognition.

A Feature Integration Network for Multi-Channel Speech Enhancement

Multi-Channel Feature Adaptation for Robust Speech Recognition

Improving Performance of Speaker Identification System Using Complementary Information Fusion

Improving Short-Duration Speaker Recognition by Joint Bark-Wavelet Acoustic Feature Coupling and Triplet Dual-Attention Mechanism Network

Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion

Orthogonal subspace combination based on the joint factor analysis for text-independent speaker recognition

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.