Hybrid Constant-Q Transform Based CNN Ensemble for Acoustic Scene Classification

Mou Wang,Rui Wang,Xiao-Lei Zhang,Susanto Rahardja
DOI: https://doi.org/10.1109/apsipaasc47483.2019.9023236
2019-01-01
Abstract:Acoustic scene classification (ASC) has attracted much attention in recent years. In previous studies, the most common architecture is convolutional neural network (CNN) fed by three main features, i.e. log-mel energies, harmonic-percussive source separation (HPSS) and constant-Q transform (CQT). In this paper, we present a hybrid constant-Q transform (HCQT) based CNN system for ASC. Specifically, we first extract CQT and HCQT from each audio clip as the acoustic features, as well as other several features such as Mel-frequency cepstral coefficients, log-mel energies and its HPSS. Then, we feed those features into 5-layer or 9-layer CNNs with average pooling separately. Considering different features that have complementary information with each other, we further develop several methods to integrate the outputs of the CNNs, including averaging, weighted averaging, random forests and extremely randomized trees. To the best of our knowledge, this is the first time HCQT based method is being used for ASC. Essentially, the method combines two CQTs with different resolutions for remedying the high-frequency bins of the traditional CQT. In addition, we investigate different ensemble strategies of the CNN models thoroughly. We evaluated the proposed system in the DCASE 2019 challenge. Experimental results show that HCQT is more effective than the conventional CQT. Furthermore, the accuracies of our system on the validation and leaderboard datasets are 77.5% and 79.3% respectively, which outperforms the two comparison baselines significantly.
What problem does this paper attempt to address?