Abstract:In this paper, we propose lightweight deep neural networks for Acoustic Scene Classification (ASC) and a visualization method for presenting a sound scene context. To this end, we first propose an inception-based and low-memory footprint ASC model as the ASC baseline. The ASC baseline is then compared with benchmark and high-complexity network architectures. Next, we improve the ASC baseline by proposing a novel deep neural network architecture which leverages a residual-inception architecture and multiple kernels. Given the novel residual-inception (NRI) based model, we apply multiple techniques of model compression to evaluate the trade off between the model complexity and the model accuracy performance. Finally, we evaluate whether sound events detected in a sound scene recording can help to improve ASC accuracy performance and to present the sound scene context more comprehensively. We conduct extensive experiments on various ASC datasets, including sound scene datasets proposed for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2018 Task 1A and 1B, 2019 Task 1A and 1B, 2020 Task 1A, 2021 Task 1A, and 2022 Task 1. Our experimental results on several different ASC challenges highlight two main achievements. First, given the analysis of the trade off between the model performance and the model complexity, we propose two low-complexity ASC models: The medium-size model (MM) presents 4.96 M trainable parameters, 19.3 MB memory occupation, and 7.12 BFLOPs; The small-size model (SM) presents a very low complexity of 120 K trainable parameters, 120 KB memory occupation, and 0.82 BFLOPs. These ASC systems are very competitive to the state-of-the-art systems and compatible for real-life applications on a wide range of edge devices. Secondly, from the analysis of the role of sound events in a sound scene, we propose an effective visualization method for comprehensively presenting a sound scene context. By combining both the sound scene and sound event information, the visualization method not only indicates predicted sound scene contexts with high probabilities but also provides statistics of sound events occurring in these sound scene contexts.

Learning Temporal Relations from Semantic Neighbors for Acoustic Scene Classification

Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks

Deep semantic learning for acoustic scene classification

SubSpectralNet - Using Sub-Spectrogram based Convolutional Neural Networks for Acoustic Scene Classification

A convolutional neural network approach for acoustic scene classification

Deep Segment Model for Acoustic Scene Classification

Bi-level Acoustic Scene Classification Using Lightweight Deep Learning Model

Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification

High-Resolution Attention Network with Acoustic Segment Model for Acoustic Scene Classification

Robust Feature Learning on Long-Duration Sounds for Acoustic Scene Classification

The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification

Lightweight deep neural networks for acoustic scene classification and an effective visualization for presenting sound scene contexts

A Low-Compexity Deep Learning Framework For Acoustic Scene Classification

Spatio-Temporal Attention Pooling for Audio Scene Classification

Robust, General, and Low Complexity Acoustic Scene Classification Systems and An Effective Visualization for Presenting a Sound Scene Context

Temporal Transformer Networks for Acoustic Scene Classification

CAA-Net: Conditional Atrous CNNs With Attention for Explainable Device-Robust Acoustic Scene Classification

TF-SepNet: An Efficient 1D Kernel Design in CNNs for Low-Complexity Acoustic Scene Classification

An Investigation of Transfer Learning Mechanism for Acoustic Scene Classification

Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Constrained Learned Feature Extraction for Acoustic Scene Classification.