Mixed-Bandwidth Cross-Channel Speech Recognition Via Joint Optimization of DNN-Based Bandwidth Expansion and Acoustic Modeling.

Jianqing Gao,Jun Du,Enhong Chen
DOI: https://doi.org/10.1109/taslp.2018.2886739
2018-01-01
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:Automatic speech recognition (ASR) systems are often built using scene related speech data due to large variations of transmission channels and sampling rates in different scenarios. In this study, we propose a general framework that establishes a unified model for diversified speech data with different sampling rates and channels. The framework is a joint optimization of deep neural network (DNN)-based bandwidth expansion and acoustic modeling to exploit a large amount of diversified training data. First, we design two novel DNN architectures to map the acoustic features from narrowband to wideband speech through direct mapping and progressive mapping. The learning targets of the direct mapping DNN (DNN-DM) are the acoustic features extracted from speech with the largest bandwidth, while the acoustic features from speech with all the other bandwidths are used as input. A progressive stacking network (PSN) gradually maps the features from the low sampling rates to the highest sampling rate through the design of intermediate target layers via multitask training. Then, in addition to these bandwidth expansion networks, we investigate several joint training strategies for DNN-based acoustic models. Our experiments conducted on three diversified large-scale Mandarin speech datasets with different recording channels and sampling rates (6,8, and 16 kHz) show that the proposed unified model using PSN for bandwidth expansion not only is a more flexible and compact design than conventional multiple acoustic models with each bandwidth for a specific sampling rate, but also yields consistent and significant improvements over bandwidth-dependent models with an average relative word error rate reduction of 6.2%, indicating that the proposed model can fully utilize the diversified cross-channel speech data with multiple bandwidths. Moreover, the proposed methods are verified to be robust on different realistic scenes and can be effectively extended to a long short-term memory framework.
What problem does this paper attempt to address?