ECHO: Environmental Sound Classification with Hierarchical Ontology-guided Semi-Supervised Learning

Pranav Gupta,Raunak Sharma,Rashmi Kumari,Sri Krishna Aditya,Shwetank Choudhary,Sumit Kumar,Kanchana M,Thilagavathy R
DOI: https://doi.org/10.1109/CONECCT62155.2024.10677303
2024-09-21
Abstract:Environment Sound Classification has been a well-studied research problem in the field of signal processing and up till now more focus has been laid on fully supervised approaches. Over the last few years, focus has moved towards semi-supervised methods which concentrate on the utilization of unlabeled data, and self-supervised methods which learn the intermediate representation through pretext task or contrastive learning. However, both approaches require a vast amount of unlabelled data to improve performance. In this work, we propose a novel framework called Environmental Sound Classification with Hierarchical Ontology-guided semi-supervised Learning (ECHO) that utilizes label ontology-based hierarchy to learn semantic representation by defining a novel pretext task. In the pretext task, the model tries to predict coarse labels defined by the Large Language Model (LLM) based on ground truth label ontology. The trained model is further fine-tuned in a supervised way to predict the actual task. Our proposed novel semi-supervised framework achieves an accuracy improvement in the range of 1\% to 8\% over baseline systems across three datasets namely UrbanSound8K, ESC-10, and ESC-50.
Sound,Computer Vision and Pattern Recognition,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the dependence on large - scale labeled data in Environmental Sound Classification (ESC). Although traditional deep - learning methods perform well on ESC tasks, they require a large amount of labeled data to achieve optimal performance, and obtaining a large amount of labeled data is both time - consuming and labor - intensive. In addition, although existing semi - supervised and self - supervised learning methods can use unlabeled data to improve performance, they still require a large amount of unlabeled data. To solve these problems, the author proposes a new framework - Environmental Sound Classification based on Hierarchical Ontology - Guided Semi - supervised Learning (ECHO). The main innovations of this framework are as follows: 1. **Utilizing the label ontology hierarchy**: By defining a new pretext task, the model can learn meaningful representations by using the implicit relationships (such as semantic similarity, category similarity, etc.) between existing labels without relying on additional unlabeled data. 2. **Automatically generating coarse - grained labels**: Using large - language models (LLM) for prompt engineering, coarse - grained labels are automatically generated according to the ontology knowledge of existing labels, thereby reducing the dependence on large - scale labeled data. 3. **Two - stage learning framework**: First, in the pre - training stage, high - level semantic representations are learned by predicting coarse - grained labels, and then in the fine - tuning stage, the learned representations are transferred to specific classification tasks to improve the final classification performance. Through this method, the ECHO framework can significantly improve the classification accuracy on multiple benchmark datasets (such as UrbanSound8K, ESC - 10, and ESC - 50), with an accuracy improvement of 1% to 8% compared to the baseline system. ### Formula Summary The loss function mentioned in the paper is the Cross - Entropy Loss, which is used for multi - classification problems: \[ H(y, \hat{y}) = -\frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{C} y_{ij} \log(\hat{y}_{ij}) \] where: - \( H(y, \hat{y}) \) is the cross - entropy loss function, - \( y_{ij} \) is a binary indicator variable, indicating whether sample \( i \) belongs to category \( j \), - \( \hat{y}_{ij} \) is the probability that the model predicts sample \( i \) belongs to category \( j \), - \( N \) is the number of samples, - \( C \) is the number of categories. This loss function can effectively measure the difference between the model prediction and the true label, thereby guiding the learning process of the model.