Abstract:Automatic audio content recognition has attracted an increasing attention for developing multimedia systems, for which the most popular approaches combine frame-based features with statistic models or discriminative classifiers. The existing methods are effective for clean single-source event detection but may not perform well for unstructured environmental sounds, which have a broad noise-like flat spectrum and a diverse variety of compositions. We present an automatic acoustic scene understanding framework that detects audio events through two hierarchies, acoustic scene recognition and audio event recognition , in which the former is preceded by following dominant audio sources and in turn helps infer non-dominant audio events within the same scene through modeling their occurrence correlations. On the scene recognition hierarchy, we perform adaptive segmentation and feature extraction for every input acoustic scene stream through Eigen-audiospace and an optimized feature subspace, respectively. After filtering background, scene streams are recognized by modeling the observation density of dominant features using a two-level hidden Markov model. On the audio event recognition hierarchy, scene knowledge is characterized by an audio context model that essentially describes the occurrence correlations of dominant and non-dominant audio events within this scene. Monte Carlo integration and gradient descent techniques are employed to maximize the likelihood and correctly tag each audio event. To the best of our knowledge, this is the first work that models event correlations as scene context for robust audio event detection from complex and noisy environments. Note that according to the recent report, the mean accuracy for the acoustic scene classification task by human listeners is only around 71 % on the data collected in office environments from the DCASE dataset. None of the existing methods performs well on all scene categories and the average accuracy of the best performances of the recent 11 methods is 53.8 %. The proposed method averagely achieves an accuracy of 62.3 % on the same dataset. Additionally, we create a 10-CASE dataset by manually collecting 5,250 audio clips of 10 scene types and 21 event categories. Our experimental results on 10-CASE show that the proposed method averagely achieves the enhanced performance of 78.3 %, and the average accuracy of audio event recognition can be effectively improved by capturing dominant audio sources and reasoning non-dominant events from the dominant ones through acoustic context modeling. In the future work, exploring the interactions between acoustic scene recognition and audio event detection, and incorporating other modalities to improve the accuracy are required to further advance the proposed framework.

Task-driven Common Subspace Learning Based Semantic Feature Extraction for Acoustic Event Recognition

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition

A scene-dependent sound event detection approach using multi-task learning

Research on Acoustic Events Recognition Method with Dimensionality Reduction Combining Attention and Mutual Information

Multi-dimensional Edge-based Audio Event Relational Graph Representation Learning for Acoustic Scene Classification

A Multi-Task Learning Framework for Sound Event Detection using High-level Acoustic Characteristics of Sounds

Deep semantic learning for acoustic scene classification

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Deep Segment Model for Acoustic Scene Classification

Large-scale audio feature extraction and SVM for acoustic scene classification

Towards Domain-Specific Cross-Corpus Speech Emotion Recognition Approach

Robust Sound Event Classification with Bilinear Multi-Column ELM-AE and Two-Stage Ensemble Learning

Hierarchical-Concatenate Fusion TDNN for sound event classification

Context-based environmental audio event recognition for scene understanding

Acoustic Modeling With Dfsmn-Ctc And Joint Ctc-Ce Learning

A Hybrid Approach to Acoustic Scene Classification Based on Universal Acoustic Models.

Sound Event Localization and Detection for Real Spatial Sound Scenes: Event-Independent Network and Data Augmentation Chains

Time–Frequency Cepstral Features and Heteroscedastic Linear Discriminant Analysis for Language Recognition

Space-and-speaker-aware Acoustic Modeling with Effective Data Augmentation for Recognition of Multi-Array Conversational Speech

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition

An Acoustic Segment Model Based Segment Unit Selection Approach to Acoustic Scene Classification with Partial Utterances