Abstract:Uyghur is a minority language, and its resources for Automatic Speech Recognition (ASR) research are always insufficient. THUYG-20 is currently the only open-sourced dataset of Uyghur speeches. State-of-the-art results of its clean and noiseless speech test task haven't been updated since the first release, which shows a big gap in the development of ASR between mainstream languages and Uyghur. In this paper, we try to bridge the gap by ultimately optimizing the ASR systems, and by developing a morpheme-based decoder, MLDG-Decoder (Morpheme Lattice Dynamically Generating Decoder for Uyghur DNN-HMM systems), which has long been missing. We have open-sourced the decoder. The MLDG-Decoder employs an algorithm, named as "on-the-fly composition with FEBABOS", to allow the back-off states and transitions to play the role of a relay station in on-the-fly composition. The algorithm empowers the dynamically generated graph to constrain the morpheme sequences in the lattices as effectively as the static and fully composed graph does when a 4-Gram morpheme-based Language Model (LM) is used. We have trained deeper and wider neural network acoustic models, and experimented with three kinds of decoding schemes. The experimental results show that the decoding based on the static and fully composed graph reduces state-of-the-art Word Error Rate (WER) on the clean and noiseless speech test task in THUYG-20 to 14.24%. The MLDG-Decoder reduces the WER to 14.54% while keeping the memory consumption reasonable. Based on the open-sourced MLDG-Decoder, readers can easily reproduce the experimental results in this paper.

Extending Multilingual ASR to New Languages Using Supplementary Encoder and Decoder Components

AudioVSR: Enhancing Video Speech Recognition with Audio Data

A Parameter-efficient Language Extension Framework for Multilingual ASR

Dual-Pipeline with Low-Rank Adaptation for New Language Integration in Multilingual ASR

Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems

Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection

Master-ASR: Achieving Multilingual Scalability and Low-Resource Adaptation in ASR with Modular Learning

Learn and Don't Forget: Adding a New Language to ASR Foundation Models

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Configurable Multilingual ASR with Speech Summary Representations

Language-universal phonetic encoder for low-resource speech recognition

Improving Uyghur ASR systems with decoders using morpheme-based language models

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Optimizing Byte-level Representation for End-to-end ASR

Scaling Up Deliberation for Multilingual ASR

Semantic Data Augmentation for End-to-End Mandarin Speech Recognition

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Prompting Large Language Models with Speech Recognition Abilities

Transfer learning of language-independent end-to-end ASR with language model fusion

Data Augmentation for End-to-end Code-switching Speech Recognition