Abstract:This study explores the feasibility of constructing a small-scale speech recognition system capable of competing with larger, modern automated speech recognition (ASR) systems in both performance and word error rate (WER). Our central hypothesis posits that a compact transformer-based ASR model can yield comparable results, specifically in terms of WER, to traditional ASR models while challenging contemporary ASR systems that boast significantly larger computational sizes. The aim is to extend ASR capabilities to under-resourced languages with limited corpora, catering to scenarios where practitioners face constraints in both data availability and computational resources. The model, comprising a compact convolutional neural network (CNN) and transformer architecture with 2.214 million parameters, challenges the conventional wisdom that large-scale transformer-based ASR systems are essential for achieving high accuracy. In comparison, contemporary ASR systems often deploy over 300 million parameters. Trained on a modest dataset of approximately 3000 h – significantly less than the 50,000 h used in larger systems – the proposed model leverages the Common Voice and LibriSpeech datasets. Evaluation on the LibriSpeech test-clean and test-other datasets produced character error rates (CERs) of 6.40% and 16.73% and WERs of 16.03% and 35.51% respectively. Comparisons with existing architectures showcase the efficiency of our model. A gated recurrent unit (GRU) architecture, albeit achieving lower error rates, incurred a computational cost 24 times larger than our proposed model. Large-scale transformer architectures, while achieving marginally lower WERs (2%–4% on LibriSpeech test-clean), require 200 times more parameters and 53,000 additional hours of training data. Modern large language models are used to improve the WERs, but require large computational resources. To further enhance performance, a small 4-g language model was integrated into our end-to-end ASR model, resulting in improved WERs. The overarching goal of this work is to provide a practical solution for practitioners dealing with limited datasets and computational resources, particularly in the context of under-resourced languages.

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

A Parameter-efficient Language Extension Framework for Multilingual ASR

Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval

Neural Model Reprogramming with Similarity Based Mapping for Low-Resource Spoken Command Recognition

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Language-agnostic Multilingual Modeling

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

Improving RNN Transducer Based ASR with Auxiliary Tasks

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

End-to-end automated speech recognition using a character based small scale transformer architecture

Continual Learning Optimizations for Auto-regressive Decoder of Multilingual ASR systems

Prompting Large Language Models with Speech Recognition Abilities

Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

On the Transformations across Reward Model, Parameter Update, and In-Context Prompt

ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Attention-Guided Adaptation for Code-Switching Speech Recognition

Neural Task Representations as Weak Supervision for Model Agnostic Cross-Lingual Transfer

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Learn and Don't Forget: Adding a New Language to ASR Foundation Models

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models