Abstract:Endangered language generally has low-resource characteristics, as an immaterial cultural resource that cannot be renewed. Automatic speech recognition (ASR) is an effective means to protect this language. However, for low-resource language, native speakers are few and labeled corpora are insufficient. ASR, thus, suffers deficiencies including high speaker dependence and over fitting, which greatly harms the accuracy of recognition. To tackle the deficiencies, the paper puts forward an approach of audiovisual speech recognition (AVSR) based on LSTM-Transformer. The approach introduces visual modality information including lip movements to reduce the dependence of acoustic models on speakers and the quantity of data. Specifically, the new approach, through the fusion of audio and visual information, enhances the expression of speakers' feature space, thus achieving the speaker adaptation that is difficult in a single modality. The approach also includes experiments on speaker dependence and evaluates to what extent audiovisual fusion is dependent on speakers. Experimental results show that the CER of AVSR is 16.9% lower than those of traditional models (optimal performance scenario), and 11.8% lower than that for lip reading. The accuracy for recognizing phonemes, especially finals, improves substantially. For recognizing initials, the accuracy improves for affricates and fricatives where the lip movements are obvious and deteriorates for stops where the lip movements are not obvious. In AVSR, the generalization onto different speakers is also better than in a single modality and the CER can drop by as much as 17.2%. Therefore, AVSR is of great significance in studying the protection and preservation of endangered languages through AI.

A General Procedure for Improving Language Models in Low-Resource Speech Recognition

Multilingual Transformer Language Model for Speech Recognition in Low-resource Languages

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech

Initial Decoding with Minimally Augmented Language Model for Improved Lattice Rescoring in Low Resource ASR

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition

Improved Meta Learning for Low Resource Speech Recognition

LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition

Advanced Recurrent Network-Based Hybrid Acoustic Models for Low Resource Speech Recognition

Improving Mandarin End-to-End Speech Recognition with Word N-gram Language Model

Improved acoustic word embeddings for zero-resource languages using multilingual transfer

Acoustic Modeling Based on Deep Learning for Low-Resource Speech Recognition: An Overview

Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

Semi-Supervised Transfer Learning for Language Expansion of End-to-End Speech Recognition Models to Low-Resource Languages

Multilingual Meta-Transfer Learning for Low-Resource Speech Recognition

Cross-Lingual and Ensemble MLPs Strategies for Low-Resource Speech Recognition

Exploiting foreign resources for DNN-based ASR

Regularization Advantages of Multilingual Neural Language Models for Low Resource Domains