Abstract:One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.

Integrating Recognition and Retrieval with User Feedback: A New Framework for Spoken Term Detection.

A Framework Integrating Different Relevance Feedback Scenarios and Approaches for Spoken Term Detection.

Improved Spoken Term Detection by Discriminative Training of Acoustic Models Based on User Relevance Feedback.

Improved Spoken Term Detection by Feature Space Pseudo-Relevance Feedback.

Improved spoken term detection using support vector machines with acoustic and context features from pseudo-relevance feedback

Improved spoken term detection using support vector machines based on lattice context consistency

Improved Spoken Term Detection with Graph-Based Re-Ranking in Feature Space

Spoken Term Detection from Bilingual Spontaneous Speech Using Code-Switched Lattice-Based Structures for Words and Subword Units.

Improved Semantic Retrieval of Spoken Content by Document/Query Expansion with Random Walk Over Acoustic Similarity Graphs

Improved Semantic Retrieval of Spoken Content by Language Models Enhanced with Acoustic Similarity Graph

Integrating Lattice-Free MMI into End-to-End Speech Recognition

Improved open-vocabulary spoken content retrieval with word and subword lattices using acoustic feature similarity

Bridging the Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition

Enhanced Spoken Term Detection Using Support Vector Machines and Weighted Pseudo Examples

Unsupervised Discovery of Structured Acoustic Tokens with Applications to Spoken Term Detection

Discriminative Boosting Algorithm for Diversified Front-End Phonotactic Language Recognition

Handling Overlaps in Spoken Term Detection

Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion

Semantic Query Expansion and Context-Based Discriminative Term Modeling for Spoken Document Retrieval

Spoken Term Detection Using Dynamic Match Subword Confusion Network

Acoustic Model Fusion for End-to-end Speech Recognition