Using Deep Belief Network to Capture Temporal Information for Audio Event Classification.

Feng Guo,Deshun Yang,Xiaoou Chen
DOI: https://doi.org/10.1109/iih-msp.2015.46
2015-01-01
Abstract:Audio event classification plays an important role in surveillance systems. Due to the constrain of short-time Fourier transform (STFT), the extraction of the audio frequency domain features, as the essential work among the audio event classification, still have some difficulty when conducted on a big audio frame. The traditional concatenation method of feature vector for the successive audio windows in one big audio frame is not perfect for the information redundancy in the low level audio representations. However the temporal information is very important in the audio event classification. In this paper, we try to capture the underlying temporal information in the audio event using the Deep Belief Network (DBN). Here the feature is extracted on a long time span. For the clear description, we call this kind of audio block in our method as audio unit rather than audio frame. There are mainly two contributions in this paper. First we segment the audio into units with different sizes and conduct an evaluation about the classification performance of different features on different unit sizes, including the traditional features and the DBN features. Second we present a method to merge audio features learned from multi-scale audio units to train a support vector machine (SVM) classifier. The classifier based on the merged DBN features outperforms other classifiers which are only based on the DBN features before merging or the traditional features respectively.
What problem does this paper attempt to address?