Multi-scale Feature Based Convolutional Neural Networks for Large Vocabulary Speech Recognition

Tong Fu,Xihong Wu
DOI: https://doi.org/10.1109/icme.2017.8019385
2017-01-01
Abstract:Deep learning has brought a breakthrough to the performance of speech recognition. The speech recognition systems based on deep neural networks have obtained the state-of-the-art performance on various speech recognition tasks. These systems almost utilize the Mel-frequency cepstral coefficients or the Mel-scale log-filterbank coefficients, which are based on short-time Fourier transform. Although these features are designed based on the auditory characteristics of the human, it is a problem that the inherent tradeoff of the temporal and frequency resolution still exists in spectral representations based on short-time Fourier transform. In this paper, we propose a multi-scale method to mitigate the tradeoff and a model architecture that enables to analyze speech at multiple scale. Experiments are conducted on TIMIT and HKUST corpus. We compare the proposed multi-scale features and traditional features at various number of configurations. Experimental results show that the proposed model architecture can obtain significant performance improvement.
What problem does this paper attempt to address?