Deep Learning Approaches for Understanding Simple Speech Commands

Roman A. Solovyev,Maxim Vakhrushev,Alexander Radionov,Irina I. Romanova,Aleksandr A. Amerikanov,Vladimir Aliev,Alexey A. Shvets
DOI: https://doi.org/10.1109/elnano50318.2020.9088863
2020-04-01
Abstract:Automatic classification of sound commands is becoming increasingly important, especially for embedded and mobile devices. Many of these devices contain both microphones and cameras. The manufacturers that develop and produce them would like to use the same methodology for sound and image classification tasks. It's possible to achieve by representing sound commands as images, and then use convolutional neural networks when classifying images as well as sounds. In this research, we tried several approaches to the problem of sound classification that we applied in TensorFlow Speech Recognition Challenge organized by Google Brain team on the Kaggle platform. Here we show different representations of sounds (Wave frames, Spectrograms, Mel-Spectrograms, MFCCs) and apply several 1D and 2D convolutional neural networks to get the best performance. As a novelty of our work, we developed and trained from scratch two 1d network architectures that are topologically similar to 2d VGG and ResNet network types. These networks show similar performance with 2d networks when sound signal is represented by using melgrams. Our experiments reveal that we found appropriate sound representation and corresponding convolutional neural networks. As a result, we achieved good classification accuracy (91.8%) that allowed us to finish the challenge on 8-th place among 1315 teams.
What problem does this paper attempt to address?