Abstract:Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes a single supervised text audio-visual contrastive loss to learn an alignment between audio-visual and textual modalities, moving away from the conventional approach of reconstructing cross-modal features and text embeddings. Our key insight is that while class name embeddings are well aligned with language-based audio-visual features, they don't provide sufficient class separation to be useful for zero-shot learning. To address this, our method leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks. Our results demonstrate that our EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.

Audio-Visual Generalized Zero-Shot Learning Based on Variational Information Bottleneck

Cluster-based Contrastive Disentangling for Generalized Zero-Shot Learning

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Contrastive Visual Feature Filtering for Generalized Zero-Shot Learning

Audio-visual Generalized Zero-shot Learning the Easy Way

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

Information Bottleneck Constrained Latent Bidirectional Embedding for Zero-Shot Learning

Out-Of-Distribution Detection for Audio-visual Generalized Zero-Shot Learning: A General Framework

On the Transferability of Visual Features in Generalized Zero-Shot Learning

Semantics Disentangling for Generalized Zero-Shot Learning

Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning

Multi-modal Generative Adversarial Network for Zero-Shot Learning

Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning

Estimation of Near-Instance-Level Attribute Bottleneck for Zero-Shot Learning

Learn More from Less: Generalized Zero-Shot Learning with Severely Limited Labeled Data

Improving generalized zero-shot learning via cluster-based semantic disentangling representation

Visual and Semantic Prototypes-Jointly Guided CNN for Generalized Zero-shot Learning

Investigating the Bilateral Connections in Generative Zero-Shot Learning

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos