Abstract:The variety and complexity of accents pose a huge challenge to robust Automatic Speech Recognition (ASR). Some previous work has attempted to address such problems, however most of the current approaches either require prior knowledge about the target accent, or cannot handle unseen accents and accent-unspecific standard speech. In this work, we aim to improve multi-accent speech recognition in the end-to-end (E2E) framework with a novel layer-wise adaptation architecture. Firstly, we propose a robust deep accent representation learning architecture to obtain accurate accent embedding, and some advanced schemes are designed to further boost the quality of accent embeddings, including phone posteriorgram (PPG) feature, TTS based data augmentation in the training stage, test-time augmentation and multi-embedding fusion in the testing stage. Then, the layer-wise adaptation with accent embeddings is developed for fast accent adaptation in ASR, and two types of adapter layers are designed, including the gated adapter layer and multi-basis adapter layer. Compared to the usual two-pass adaptation, these adapter layers are injected between the ASR encoder layers to encode the accent information in ASR flexibly, and perform fast adaption on the corresponding speech accent. The experiments on Accent AESRC corpus show that the proposed deep accent representation learning can capture accurate accent knowledge, and get high performance on accent classification. The new layer-wise adaptation architecture with the accurate accent embedding outperforms the other traditional methods, and obtains consistent $\sim$15% relative word error rate (WER) reduction on all kinds of testing scenarios, including seen accents, unseen accents and accent-unspecific standard speech.

English Accent Accuracy Analysis in a State-of-the-Art Automatic Speech Recognition System

The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR

Foreign English Accent Adjustment by Learning Phonetic Patterns

CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice

Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

Global Performance Disparities Between English-Language Accents in Automatic Speech Recognition

Quantifying Bias in Automatic Speech Recognition

Learning Fast Adaptation on Cross-Accented Speech Recognition

Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models

Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech

Synthetic Cross-accent Data Augmentation for Automatic Speech Recognition

Improving Speech Recognition for African American English With Audio Classification

Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition

The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and Methods

Investigating the Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes

Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition

Improving Accented Speech Recognition with Multi-Domain Training

Linguistic-Acoustic Similarity Based Accent Shift for Accent Recognition

Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding by Provenance

END-TO-END MULTI-ACCENT SPEECH RECOGNITION WITH UNSUPERVISED ACCENT MODELLING

Some voices are too common: Building fair speech recognition systems using the Common Voice dataset