An Investigation on Multiscale Normalised Deep Scattering Spectrum with Deep Residual Network for Acoustic Scene Classification

Ye Li,C. Chin,Xing Yong Kek
DOI: https://doi.org/10.1109/SNPD51163.2021.9704888
2021-11-24
Abstract:This paper investigates how time scale affects the classification accuracy of log Mel-frequency coefficients and deep scattering spectrum for acoustic scene classification. Currently, log Mel-frequency coefficients has dominated in most acoustic classification task as observed in DCASE challenge. However, log Mel-frequency coefficients have two flaws; the first flaw is the Heisenberg uncertain property of short-time Fourier transform, which is caused by a fixed window size. A trade-off between having high frequency resolution while suffering from poor time resolution and vice versa. The next flaw occurs when applying mel-filter banks along frequency axis, resulting in a loss of information when the time scale is more than 25ms. To overcome this limitation, this paper explored deep scattering spectrum with various window intervals. Following the current framework of log Mel-frequency coefficients integration with convolution neural network, we proposed a two-stage convolution neural network model approach. The two-stage model is designed to tackle the huge disparity in magnitude of the deep scattering spectrum's first and second order coefficients. Next, we explored various feature normalization technique and applied on the input representation directly, thus allowing learning to occur. Lastly, our experimentation uses the DCASE 2020 Task 1a dataset, consisting of acoustic recordings from various environments or scenes and demonstrated that DSS has a slight advantage against MFSC and scored 70.36% and 69.42%, respectively.
Computer Science,Engineering
What problem does this paper attempt to address?