Abstract:Most existing methods for audio classification assume that the vocabulary of audio classes to be classified is fixed. When novel (unseen) audio classes appear, audio classification systems need to be retrained with abundant labeled samples of all audio classes for recognizing base (initial) and novel audio classes. If novel audio classes continue to appear, the existing methods for audio classification will be inefficient and even infeasible. In this work, we propose a method for few-shot class-incremental audio classification, which can continually recognize novel audio classes without forgetting old ones. The framework of our method mainly consists of two parts: an embedding extractor and a classifier, and their constructions are decoupled. The embedding extractor is the backbone of a ResNet based network, which is frozen after construction by a training strategy using only samples of base audio classes. However, the classifier consisting of prototypes is expanded by a prototype adaptation network with few samples of novel audio classes in incremental sessions. Labeled support samples and unlabeled query samples are used to train the prototype adaptation network and update the classifier, since they are informative for audio classification. Three audio datasets, named NSynth-100, FSC-89 and LS-100 are built by choosing samples from audio corpora of NSynth, FSD-MIX-CLIP and LibriSpeech, respectively. Results show that our method exceeds baseline methods in average accuracy and performance dropping rate. In addition, it is competitive compared to baseline methods in computational complexity and memory requirement. The code for our method is given at <a class="link-external link-https" href="https://github.com/vinceasvp/FCAC" rel="external noopener nofollow">this https URL</a>.

A Multimodal Prototypical Approach for Unsupervised Sound Classification

Robust Audio Sensing with Multi-Sound Classification.

Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping

A sound description: Exploring prompt templates and class descriptions to enhance zero-shot audio classification

Hybrid Attention-Based Prototypical Networks for Few-Shot Sound Classification

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

Few-shot Class-incremental Audio Classification Using Dynamically Expanded Classifier with Self-attention Modified Prototypes

Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes

Multimodal Speech Recognition Using EEG and Audio Signals: A Novel Approach for Enhancing ASR Systems

Unsupervised Improvement of Audio-Text Cross-Modal Representations

Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

SoundCollage: Automated Discovery of New Classes in Audio Datasets

Multimodal Urban Sound Tagging with Spatiotemporal Context

Exploring modality-agnostic representations for music classification

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Deep Multimodal Clustering for Unsupervised Audiovisual Learning

Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

A novel hybrid ensemble approach to enhance the acoustic event classification in environmental sound analysis