Abstract:Acoustic scene classification (ASC) aims to identify the type of scene (environment) in which a given audio signal is recorded. The log-mel feature and convolutional neural network (CNN) have recently become the most popular time-frequency (TF) feature representation and classifier in ASC. An audio signal recorded in a scene may include various sounds overlapping in time and frequency. The previous study suggests that separately considering the long-duration sounds and short-duration sounds in CNN may improve ASC accuracy. This study addresses the problem of the generalization ability of acoustic scene classifiers. In practice, acoustic scene signals' characteristics may be affected by various factors, such as the choice of recording devices and the change of recording locations. When an established ASC system predicts scene classes on audios recorded in unseen scenarios, its accuracy may drop significantly. The long-duration sounds not only contain domain-independent acoustic scene information, but also contain channel information determined by the recording conditions, which is prone to over-fitting. For a more robust ASC system, We propose a robust feature learning (RFL) framework to train the CNN. The RFL framework down-weights CNN learning specifically on long-duration sounds. The proposed method is to train an auxiliary classifier with only long-duration sound information as input. The auxiliary classifier is trained with an auxiliary loss function that assigns less learning weight to poorly classified examples than the standard cross-entropy loss. The experimental results show that the proposed RFL framework can obtain a more robust acoustic scene classifier towards unseen devices and cities.

Temporal Transformer Networks for Acoustic Scene Classification

Constrained Learned Feature Extraction for Acoustic Scene Classification.

Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification

Spatio-Temporal Attention Pooling for Audio Scene Classification

Temporal Transformer Networks with Self-Supervision for Action Recognition.

Audio Transformers:Transformer Architectures For Large Scale Audio Understanding. Adieu Convolutions

Multi-modal Attention Mechanisms in LSTM and Its Application to Acoustic Scene Classification

Data Independent Sequence Augmentation Method for Acoustic Scene Classification.

A Low-Compexity Deep Learning Framework For Acoustic Scene Classification

Multi-stream Network With Temporal Attention For Environmental Sound Classification

TSLANet: Rethinking Transformers for Time Series Representation Learning

A convolutional neural network approach for acoustic scene classification

Audio Time-Scale Modification with Temporal Compressing Networks

CT-SAT: Contextual Transformer for Sequential Audio Tagging

TempoFormer: A Transformer for Temporally-aware Representations in Change Detection

Ensemble Of Deep Neural Networks For Acoustic Scene Classification

Robust Feature Learning on Long-Duration Sounds for Acoustic Scene Classification

Explore Relative and Context Information with Transformer for Joint Acoustic Echo Cancellation and Speech Enhancement

Hierarchical learning for DNN-based acoustic scene classification

Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Long-term scalogram integrated with an iterative data augmentation scheme for acoustic scene classification