Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Yi-Kai Zhang,Da-Wei Zhou,Han-Jia Ye,De-Chuan Zhan
DOI: https://doi.org/10.21437/interspeech.2022-652
2022-01-01
Abstract:Although deep learning-based audio-visual speech recognition (AVSR) systems recognize base closed-set categories well, extending their discerning ability to additional novel categories with limited labeled training data is challenging since the model easily over-fits. In this paper, we propose Prototype-based Co-Adaptation with Transformer (PROTO-CAT), a multi-modal generalized few-shot learning (GFSL) method for AVSR systems. In other words, PROTO-CAT learns to recognize a novel class multi-modal object with few-shot training data, while maintaining its ability on those base closed-set categories. The main idea is to transform the prototypes (i.e., class centers) by incorporating cross-modality complementary information and calibrating cross-category semantic differences. In particular, PROTO-CAT co-adapts the embeddings from audiovisual and category levels, so that it generalizes its predictions on all categories dynamically. PROTO-CAT achieves state-of-the-art performance on various AVSR-GFSL benchmarks. The code is available at https://github.com/ZhangYikaii/Proto-CAT.
What problem does this paper attempt to address?