Abstract:One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.

Speaking Rate Normalization With Lattice-Based Context-Dependent Phoneme Duration Modeling For Personalized Speech Recognizers On Mobile Devices

Linguistic Feedback Supports Rapid Adaptation to Acoustically Degraded Speech

Personalized Speech Recognizer With Keyword-Based Personalized Lexicon And Language Model Using Word Vector Representations

Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications

Towards High Performance LVCSR in Speech-to-Speech Translation System on Smart Phones.

Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion

A System for Mandarin Short Phrase Recognition on Portable Devices

On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech

Towards Robustness to Speech Rate in Mandarin All-Syllable Recognition

Phone Lattice Reconstruction for Embedded Language Recognition in LVCSR

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

Research on Score Domain Speaking Rate Normalization for Speaker Recognition

Refining phoneme segmentations using speaker-adaptive context dependent boundary models.

Silenttalk: Lip Reading Through Ultrasonic Sensing on Mobile Phones

Reducing pronunciation lexicon confusion and using more data without phonetic transcription for pronunciation modeling

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition

Recurrent Neural Network Based Language Model Personalization by Social Network Crowdsourcing

Research on Context-Dependent Acoustical Unit (Triphone) for Mandarin Continuous Speech Recognition

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

The speaking rate adaptation algorithm in Putonghua continuous speech recognition