Abstract:Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes a single supervised text audio-visual contrastive loss to learn an alignment between audio-visual and textual modalities, moving away from the conventional approach of reconstructing cross-modal features and text embeddings. Our key insight is that while class name embeddings are well aligned with language-based audio-visual features, they don't provide sufficient class separation to be useful for zero-shot learning. To address this, our method leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks. Our results demonstrate that our EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses several key issues in Audio-Visual Generalized Zero-shot Learning (AVGZSL): 1. **Relationship between Cross-modal Features and Class Label Embeddings**: Traditional synchronous autoencoder methods, while capable of capturing information through cross-attention transformers and projected text embeddings during the reconstruction of audio-visual attributes, struggle to effectively understand the complex relationship between cross-modal features and pre-trained language-aligned embeddings. 2. **Class Separation in Zero-shot Learning**: Although class name embeddings align well with language-based audio-visual features, they lack sufficient class separation in zero-shot learning, leading to poor prediction performance. To address the above issues, the authors propose a simple yet effective framework—Easy Audio-Visual Generalized Zero-shot Learning (EZ-AVGZL). This framework improves existing methods through the following approaches: - **Class Embedding Optimization**: Utilizing differential optimization techniques to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. - **Supervised Contrastive Loss**: Employing a single supervised text-audio-visual contrastive loss to align audio-visual features with text representations, thereby avoiding the complex cross-modal feature and text embedding reconstruction process seen in traditional methods. Experimental results demonstrate that this method significantly outperforms existing approaches on multiple benchmark datasets and exhibits robust performance across various zero-shot learning tasks.

Audio-visual Generalized Zero-shot Learning the Easy Way

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

Cluster-based Contrastive Disentangling for Generalized Zero-Shot Learning

Contrastive Visual Feature Filtering for Generalized Zero-Shot Learning

Audio-Visual Generalized Zero-Shot Learning Based on Variational Information Bottleneck

Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

Temporal and cross-modal attention for audio-visual zero-shot learning

Out-Of-Distribution Detection for Audio-visual Generalized Zero-Shot Learning: A General Framework

Multi-dimensional Alignment Via Variational Autoencoders for Generalized Zero-Shot and Few-Shot Learning.

Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

Learning Discriminative Projection with Visual Semantic Alignment for Generalized Zero Shot Learning.

Boosting Audio-visual Zero-shot Learning with Large Language Models

Hyperbolic Audio-visual Zero-shot Learning

Generalized Zero-Shot Learning Via Multi-Modal Aggregated Posterior Aligning Neural Network

Indirect visual–semantic alignment for generalized zero-shot recognition

Multi-Dimensional Information Alignment in Different Modalities for Generalized Zero-Shot and Few-Shot Learning

Generalized Zero-Shot Recognition based on Visually Semantic Embedding

Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Semantics Disentangling for Generalized Zero-Shot Learning