Audio Recognition using Mel Spectrograms and Convolution Neural Networks

Jared Leitner,Samuel Thornton,Boyang Zhang
Abstract:— Automatic sound recognition has received heightened research interest in recent years due to its many potential applications. These include automatic labeling of video/audio content and real-time sound detection for robotics. While image classification is a heavily researched topic, sound identification is less mature. In this study, we take advantage of the robust machine learning techniques developed for image classification and apply them on the sound recognition problem. Raw audio data from the Freesound Dataset (FSD) provided by Kaggle is first converted to a spectrogram representation in order to apply these image classification techniques. We test and compare two approaches using deep convolutional neural networks (CNNs): 1.) Our own CNN architecture 2.) Transfer learning using the pre-trained VVG19 network. Using our self-developed architecture, we achieve a label-weighted label-ranking average precision (LWLARP) score and top-5 accuracy of 0.813 and 88.9%, respectively, when predicting 80 sound classes.
Engineering,Computer Science
What problem does this paper attempt to address?