Abstract:Environmental sound classiﬁcation (ESC) is a challenging problem due to the unstructured spatial-temporal relations that exist in the sound signals. Re-cently, many studies have focused on abstracting features from convolutional neural networks while the learning of semantically relevant frames of sound signals has been overlooked. To this end, we present an end-to-end framework, namely feature pyramid attention network (FPAM), focusing on abstracting the semantically relevant features for ESC. We ﬁrst extract the feature maps of the preprocessed spectrogram of the sound waveform by a backbone network. Then, to build multi-scale hierarchical features of sound spectrograms, we construct a feature pyramid representation of the sound spectrograms by aggregating the feature maps from multi-scale layers, where the temporal frames and spatial locations of semantically relevant frames are localized by FPAM. Speciﬁcally, the multiple features are ﬁrst processed by a dimension alignment module. Af-terward, the pyramid spatial attention module (PSA) is attached to localize the important frequency regions spatially with a spatial attention module (SAM). Last, the processed feature maps are reﬁned by a pyramid channel attention (PCA) to localize the important temporal frames. To justify the eﬀectiveness of the proposed FPAM, visualization of attention maps on the spectrograms has been presented. The visualization results show that FPAM can focus more on the semantic relevant regions while neglecting the noises. The eﬀectiveness of the proposed methods is validated on two widely used ESC datasets: the ESC-50 and ESC-10 datasets. The experimental results show that the FPAM yields comparable performance to state-of-the-art methods. A substantial performance increase has been achieved by FPAM compared with the baseline methods.

Acoustic scene classification by feed forward neural network with class dependent attention mechanism

High-Resolution Attention Network with Acoustic Segment Model for Acoustic Scene Classification

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Multi-Modal Attention Mechanisms In Lstm And Its Application To Acoustic Scene Classification

Spatio-Temporal Attention Pooling for Audio Scene Classification

Frequency-based CNN and attention module for acoustic scene classification

A Hybrid Approach to Acoustic Scene Classification Based on Universal Acoustic Models.

Low-Complexity Acoustic Scene Classification Using Parallel Attention-Convolution Network

Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification

Investigation of acoustic and visual features for acoustic scene classification

A Simple Fusion of Deep and Shallow Learning for Acoustic Scene Classification

Constrained Learned Feature Extraction for Acoustic Scene Classification

Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

Multi-stream Network With Temporal Attention For Environmental Sound Classification

A Deep Neural Network for Audio Classification with a Classifier Attention Mechanism

Deep Neural Decision Forest for Acoustic Scene Classification

An Investigation of High-Resolution Modeling Units of Deep Neural Networks for Acoustic Scene Classification

Ensemble Of Deep Neural Networks For Acoustic Scene Classification

Adaptive Memory-Controlled Self-Attention for Polyphonic Sound Event Detection

A convolutional neural network approach for acoustic scene classification

Audio-visual scene recognition using attention-based graph convolutional model