Abstract:Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes a single supervised text audio-visual contrastive loss to learn an alignment between audio-visual and textual modalities, moving away from the conventional approach of reconstructing cross-modal features and text embeddings. Our key insight is that while class name embeddings are well aligned with language-based audio-visual features, they don't provide sufficient class separation to be useful for zero-shot learning. To address this, our method leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks. Our results demonstrate that our EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.

Indirect visual–semantic alignment for generalized zero-shot recognition

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

GENERATING MANIFOLD-ALIGNED SEMANTIC FEATURE FOR ZERO-SHOT LEARNING

Contrastive Visual Feature Filtering for Generalized Zero-Shot Learning

Cluster-based Contrastive Disentangling for Generalized Zero-Shot Learning

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths.

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

Manifold Embedding for Zero-Shot Recognition

Multi-modal Generative Adversarial Network for Zero-Shot Learning

Learning discriminative visual semantic embedding for zero-shot recognition

Manifold Regularized Cross-Modal Embedding for Zero-Shot Learning

Semantics Disentangling for Generalized Zero-Shot Learning

Semantic Softmax Loss for Zero-Shot Learning

An Inverse Mapping With Manifold Alignment For Zero-Shot Learning

Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Learning complementary semantic information for zero-shot recognition

Visual and Semantic Prototypes-Jointly Guided CNN for Generalized Zero-shot Learning

A Novel Perspective to Zero-shot Learning: Towards an Alignment of Manifold Structures via Semantic Feature Expansion

Audio-visual Generalized Zero-shot Learning the Easy Way