Abstract:Zero-shot detection (ZSD) aims to locate and classify unseen objects in pictures or videos by semantic auxiliary information without additional training examples. Most of the existing ZSD methods are based on two-stage models, which achieve the detection of unseen classes by aligning object region proposals with semantic embeddings. However, these methods have several limitations, including poor region proposals for unseen classes, lack of consideration of semantic representations of unseen classes or their inter-class correlations, and domain bias towards seen classes, which can degrade overall performance. To address these issues, the Trans-ZSD framework is proposed, which is a transformer-based multi-scale contextual detection framework that explicitly exploits inter-class correlations between seen and unseen classes and optimizes feature distribution to learn discriminative features. Trans-ZSD is a single-stage approach that skips proposal generation and performs detection directly, allowing the encoding of long-term dependencies at multiple scales to learn contextual features while requiring fewer inductive biases. Trans-ZSD also introduces a foreground-background separation branch to alleviate the confusion of unseen classes and backgrounds, contrastive learning to learn inter-class uniqueness and reduce misclassification between similar classes, and explicit inter-class commonality learning to facilitate generalization between related classes. Trans-ZSD addresses the domain bias problem in end-to-end generalized zero-shot detection (GZSD) models by using balance loss to maximize response consistency between seen and unseen predictions, ensuring that the model does not bias towards seen classes. The Trans-ZSD framework is evaluated on the PASCAL VOC and MS COCO datasets, demonstrating significant improvements over existing ZSD models.

Zero-Shot Text-to-Image Generation

Zero-Shot Learning with Generative Latent Prototype Model.

Emage: Non-Autoregressive Text-to-Image Generation

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Text2Model: Text-based Model Induction for Zero-shot Image Classification

Transformer-Based Approach Via Contrastive Learning for Zero-Shot Detection.

CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

Zero-shot Text Classification via Reinforced Self-training

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Zero-Shot Image Classification with Rectified Embedding Vectors Using a Caption Generator

A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

VGDIFFZERO: Text-To-Image Diffusion Models Can Be Zero-Shot Visual Grounders.

Zero-shot spatial layout conditioning for text-to-image diffusion models

Zero-shot Text-guided Infinite Image Synthesis with LLM guidance

Zero-VAE-GAN: Generating Unseen Features for Generalized and Transductive Zero-Shot Learning

Text guided zero-shot scene classification of high spatial resolution remote sensing images

Retrieval Augmented Zero-Shot Text Classification

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

No Token Left Behind: Explainability-Aided Image Classification and Generation