Abstract:Automatic speech recognition (ASR) systems are often built using scene related speech data due to large variations of transmission channels and sampling rates in different scenarios. In this study, we propose a general framework that establishes a unified model for diversified speech data with different sampling rates and channels. The framework is a joint optimization of deep neural network (DNN)-based bandwidth expansion and acoustic modeling to exploit a large amount of diversified training data. First, we design two novel DNN architectures to map the acoustic features from narrowband to wideband speech through direct mapping and progressive mapping. The learning targets of the direct mapping DNN (DNN-DM) are the acoustic features extracted from speech with the largest bandwidth, while the acoustic features from speech with all the other bandwidths are used as input. A progressive stacking network (PSN) gradually maps the features from the low sampling rates to the highest sampling rate through the design of intermediate target layers via multitask training. Then, in addition to these bandwidth expansion networks, we investigate several joint training strategies for DNN-based acoustic models. Our experiments conducted on three diversified large-scale Mandarin speech datasets with different recording channels and sampling rates (6,8, and 16 kHz) show that the proposed unified model using PSN for bandwidth expansion not only is a more flexible and compact design than conventional multiple acoustic models with each bandwidth for a specific sampling rate, but also yields consistent and significant improvements over bandwidth-dependent models with an average relative word error rate reduction of 6.2%, indicating that the proposed model can fully utilize the diversified cross-channel speech data with multiple bandwidths. Moreover, the proposed methods are verified to be robust on different realistic scenes and can be effectively extended to a long short-term memory framework.

Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR Through Efficient Joint Optimization

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Dynamic Speaker Representations Adjustment and Decoder Factorization for Speaker Adaptation in End-to-End Speech Synthesis

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Speaker-Invariant Training Via Adversarial Learning.

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

Extending Whisper with prompt tuning to target-speaker ASR

Speaker Adaptation on Articulation and Acoustics for Articulation-to-Speech Synthesis

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Speaker-Smoothed kNN Speaker Adaptation for End-to-End ASR

ASRRL-TTS: Agile Speaker Representation Reinforcement Learning for Text-to-Speech Speaker Adaptation

Target Speaker ASR with Whisper

Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

Focusing on attention: prosody transfer and adaptative optimization strategy for multi-speaker end-to-end speech synthesis

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation

Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention

Mixed-Bandwidth Cross-Channel Speech Recognition Via Joint Optimization of DNN-Based Bandwidth Expansion and Acoustic Modeling.

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.