Abstract:The goal of Few-Shot Continual Learning (FSCL) is to incrementally learn novel tasks with limited labeled samples and preserve previous capabilities simultaneously, while current FSCL methods are all for the class-incremental purpose. Moreover, the evaluation of FSCL solutions is only the cumulative performance of all encountered tasks, but there is no work on exploring the domain generalization ability. Domain generalization is a challenging yet practical task that aims to generalize beyond training domains. In this paper, we set up a Generalized FSCL (GFSCL) protocol involving both class- and domain-incremental situations together with the domain generalization assessment. Firstly, two benchmark datasets and protocols are newly arranged, and detailed baselines are provided for this unexplored configuration. We find that common continual learning methods have poor generalization ability on unseen domains and cannot better cope with the catastrophic forgetting issue in cross-incremental tasks. In this way, we further propose a rehearsal-free framework based on Vision Transformer (ViT) named Contrastive Mixture of Adapters (CMoA). Due to different optimization targets of class increment and domain increment, the CMoA contains two parts: (1) For the class-incremental issue, the Mixture of Adapters (MoA) module is incorporated into ViT, then cosine similarity regularization and the dynamic weighting are designed to make each adapter learn specific knowledge and concentrate on particular classes. (2) For the domain-related issues and domain-invariant representation learning, we alleviate the inner-class variation by prototype-calibrated contrastive learning. The codes and protocols are available at <a class="link-external link-https" href="https://github.com/yawencui/CMoA" rel="external noopener nofollow">this https URL</a>.

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Adaptive Parametric Prototype Learning for Cross-Domain Few-Shot Classification

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Few-shot Class-incremental Audio Classification Using Dynamically Expanded Classifier with Self-attention Modified Prototypes

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

Few-shot Class-incremental Audio Classification Using Adaptively-refined Prototypes

Temporal–Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

Unsupervised Prototype Adapter for Vision-Language Models

Generalized Few-Shot Continual Learning with Contrastive Mixture of Adapters

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Learning Embedding Adaptation for Few-Shot Learning

Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions

CATNet: Cross-modal fusion for audio-visual speech recognition

Hybrid Consistency Training with Prototype Adaptation for Few-Shot Learning

AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

Mind the Gap Between Prototypes and Images in Cross-domain Finetuning

Improved prototypical network for active few-shot learning