Abstract:Environment Sound Classification has been a well-studied research problem in the field of signal processing and up till now more focus has been laid on fully supervised approaches. Over the last few years, focus has moved towards semi-supervised methods which concentrate on the utilization of unlabeled data, and self-supervised methods which learn the intermediate representation through pretext task or contrastive learning. However, both approaches require a vast amount of unlabelled data to improve performance. In this work, we propose a novel framework called Environmental Sound Classification with Hierarchical Ontology-guided semi-supervised Learning (ECHO) that utilizes label ontology-based hierarchy to learn semantic representation by defining a novel pretext task. In the pretext task, the model tries to predict coarse labels defined by the Large Language Model (LLM) based on ground truth label ontology. The trained model is further fine-tuned in a supervised way to predict the actual task. Our proposed novel semi-supervised framework achieves an accuracy improvement in the range of 1\% to 8\% over baseline systems across three datasets namely UrbanSound8K, ESC-10, and ESC-50.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the dependence on large - scale labeled data in Environmental Sound Classification (ESC). Although traditional deep - learning methods perform well on ESC tasks, they require a large amount of labeled data to achieve optimal performance, and obtaining a large amount of labeled data is both time - consuming and labor - intensive. In addition, although existing semi - supervised and self - supervised learning methods can use unlabeled data to improve performance, they still require a large amount of unlabeled data. To solve these problems, the author proposes a new framework - Environmental Sound Classification based on Hierarchical Ontology - Guided Semi - supervised Learning (ECHO). The main innovations of this framework are as follows: 1. **Utilizing the label ontology hierarchy**: By defining a new pretext task, the model can learn meaningful representations by using the implicit relationships (such as semantic similarity, category similarity, etc.) between existing labels without relying on additional unlabeled data. 2. **Automatically generating coarse - grained labels**: Using large - language models (LLM) for prompt engineering, coarse - grained labels are automatically generated according to the ontology knowledge of existing labels, thereby reducing the dependence on large - scale labeled data. 3. **Two - stage learning framework**: First, in the pre - training stage, high - level semantic representations are learned by predicting coarse - grained labels, and then in the fine - tuning stage, the learned representations are transferred to specific classification tasks to improve the final classification performance. Through this method, the ECHO framework can significantly improve the classification accuracy on multiple benchmark datasets (such as UrbanSound8K, ESC - 10, and ESC - 50), with an accuracy improvement of 1% to 8% compared to the baseline system. ### Formula Summary The loss function mentioned in the paper is the Cross - Entropy Loss, which is used for multi - classification problems: \[ H(y, \hat{y}) = -\frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{ij} \log(\hat{y}_{ij}) \] where: - \( H(y, \hat{y}) \) is the cross - entropy loss function, - \( y_{ij} \) is a binary indicator variable, indicating whether sample \( i \) belongs to category \( j \), - \( \hat{y}_{ij} \) is the probability that the model predicts sample \( i \) belongs to category \( j \), - \( N \) is the number of samples, - \( C \) is the number of categories. This loss function can effectively measure the difference between the model prediction and the true label, thereby guiding the learning process of the model.

ECHO: Environmental Sound Classification with Hierarchical Ontology-guided Semi-Supervised Learning

SoundCLR: Contrastive Learning of Representations For Improved Environmental Sound Classification

Environment Sound Classification using Multiple Feature Channels and Attention based Deep Convolutional Neural Network

Attention based Convolutional Recurrent Neural Network for Environmental Sound Classification

Learning Frame Level Attention for Environmental Sound Classification

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

Pretraining Respiratory Sound Representations using Metadata and Contrastive Learning

Robust Audio Sensing with Multi-Sound Classification.

Environmental Sound Classification Using Local Binary Pattern and Audio Features Collaboration

Multiclass environmental sound classification model based on adding residual connections to self-attention layers

Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments

Feature Pyramid Attention based Residual Neural Network for Environmental Sound Classification

SS-ESC: a spectral subtraction denoising based deep network model on environmental sound classification

Deep Convolutional Neural Network with Mixup for Environmental Sound Classification

An Automatic Classification System for Environmental Sound in Smart Cities

HIERMATCH: Leveraging Label Hierarchies for Improving Semi-Supervised Learning

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

Interpretable and Robust Machine Learning for Exploring and Classifying Soundscape Data

Improving Acoustic Scene Classification Via Self-Supervised and Semi-Supervised Learning with Efficient Audio Transformer

Hierarchical classification for acoustic scenes using deep learning

Echo-aware Adaptation of Sound Event Localization and Detection in Unknown Environments