Temporal Transformer Networks for Acoustic Scene Classification

Teng Zhang,Kailai Zhang,Ji Wu
DOI: https://doi.org/10.21437/interspeech.2018-1152
2018-01-01
Abstract:Neural networks have been proven to be powerful models for acoustic scene classification tasks, but are still limited by the lack of ability to be temporally invariant to the audio data. In this paper, a novel temporal transformer module is proposed to allow the temporal manipulation of data in neural networks. This module is composed of a Fourier transform layer for feature maps and a learnable feature reduction layer, and can be inserted into existing convolutional neural network (CNN) and Long short-term memory (LSTM) models. Experiments on LITIS Rouen dataset and DCASE2016 dataset show that the proposed method leads to a significant improvement when compared with the existing neural networks. Our approach is able to perform significantly better than the state-of-the-art result on LITIS Roucn dataset, obtaining a relative reduction of 23.6% on classification error.
What problem does this paper attempt to address?