Abstract:Most existing methods for audio classification assume that the vocabulary of audio classes to be classified is fixed. When novel (unseen) audio classes appear, audio classification systems need to be retrained with abundant labeled samples of all audio classes for recognizing base (initial) and novel audio classes. If novel audio classes continue to appear, the existing methods for audio classification will be inefficient and even infeasible. In this work, we propose a method for few-shot class-incremental audio classification, which can continually recognize novel audio classes without forgetting old ones. The framework of our method mainly consists of two parts: an embedding extractor and a classifier, and their constructions are decoupled. The embedding extractor is the backbone of a ResNet based network, which is frozen after construction by a training strategy using only samples of base audio classes. However, the classifier consisting of prototypes is expanded by a prototype adaptation network with few samples of novel audio classes in incremental sessions. Labeled support samples and unlabeled query samples are used to train the prototype adaptation network and update the classifier, since they are informative for audio classification. Three audio datasets, named NSynth-100, FSC-89 and LS-100 are built by choosing samples from audio corpora of NSynth, FSD-MIX-CLIP and LibriSpeech, respectively. Results show that our method exceeds baseline methods in average accuracy and performance dropping rate. In addition, it is competitive compared to baseline methods in computational complexity and memory requirement. The code for our method is given at <a class="link-external link-https" href="https://github.com/vinceasvp/FCAC" rel="external noopener nofollow">this https URL</a>.

An Improved Audio Classification Method Based on Parameter-Free Attention Combined with Self-Supervision

Utterance-Based Audio Sentiment Analysis Learned by a Parallel Combination of CNN and LSTM.

Audio Sentiment Analysis by Heterogeneous Signal Features Learned from Utterance-Based Parallel Neural Network.

Improving Acoustic Scene Classification Via Self-Supervised and Semi-Supervised Learning with Efficient Audio Transformer

Exploring the Power of Pure Attention Mechanisms in Blind Room Parameter Estimation

CAT: Causal Audio Transformer for Audio Classification

Audio xLSTMs: Learning Self-Supervised Audio Representations with xLSTMs

Input-independent Attention Weights Are Expressive Enough: A Study of Attention in Self-supervised Audio Transformers

ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification

IMPROVING MULTIMODAL SPEECH ENHANCEMENT BY INCORPORATING SELF-SUPERVISED AND CURRICULUM LEARNING

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer

Dual-Branch Attention-In-Attention Transformer for Single-Channel Speech Enhancement

Hand-crafted Attention is All You Need? A Study of Attention on Self-supervised Audio Transformer.

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision.

Audio Mamba: Pretrained Audio State Space Model For Audio Tagging

Understanding Self-Attention of Self-Supervised Audio Transformers

An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

Few-shot Class-incremental Audio Classification Using Dynamically Expanded Classifier with Self-attention Modified Prototypes

Progressive Multi-scale Self-supervised Learning for Speech Recognition