Abstract:Automatic speech recognition (ASR) systems are often built using scene related speech data due to large variations of transmission channels and sampling rates in different scenarios. In this study, we propose a general framework that establishes a unified model for diversified speech data with different sampling rates and channels. The framework is a joint optimization of deep neural network (DNN)-based bandwidth expansion and acoustic modeling to exploit a large amount of diversified training data. First, we design two novel DNN architectures to map the acoustic features from narrowband to wideband speech through direct mapping and progressive mapping. The learning targets of the direct mapping DNN (DNN-DM) are the acoustic features extracted from speech with the largest bandwidth, while the acoustic features from speech with all the other bandwidths are used as input. A progressive stacking network (PSN) gradually maps the features from the low sampling rates to the highest sampling rate through the design of intermediate target layers via multitask training. Then, in addition to these bandwidth expansion networks, we investigate several joint training strategies for DNN-based acoustic models. Our experiments conducted on three diversified large-scale Mandarin speech datasets with different recording channels and sampling rates (6,8, and 16 kHz) show that the proposed unified model using PSN for bandwidth expansion not only is a more flexible and compact design than conventional multiple acoustic models with each bandwidth for a specific sampling rate, but also yields consistent and significant improvements over bandwidth-dependent models with an average relative word error rate reduction of 6.2%, indicating that the proposed model can fully utilize the diversified cross-channel speech data with multiple bandwidths. Moreover, the proposed methods are verified to be robust on different realistic scenes and can be effectively extended to a long short-term memory framework.

AutoMode-ASR: Learning to Select ASR Systems for Better Quality and Cost

Master-ASR: Achieving Multilingual Scalability and Low-Resource Adaptation in ASR with Modular Learning

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Combining Frame-Synchronous and Label-Synchronous Systems for Speech Recognition

Optimizing Byte-level Representation for End-to-end ASR

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

Crossing language identification: Multilingual ASR framework based on semantic dataset creation & Wav2Vec 2.0

Consistency Based Unsupervised Self-training For ASR Personalisation

Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

Mixed-Bandwidth Cross-Channel Speech Recognition Via Joint Optimization of DNN-Based Bandwidth Expansion and Acoustic Modeling.

Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Audio Enhancement for Computer Audition—An Iterative Training Paradigm Using Sample Importance

SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability

Anatomy of Industrial Scale Multilingual ASR

Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR Through Efficient Joint Optimization

Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models

Wav2vec‐MoE: an Unsupervised Pre‐training and Adaptation Method for Multi‐accent ASR

Teach an all-rounder with experts in different domains