Abstract:Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in Document - based Zero - Shot Learning (ZSL), existing methods transfer knowledge by aligning the entire document semantics with images, but overlook the partial relevance between documents and images. This global alignment method leads to sub - optimal semantic alignment effects. Specifically, the problems include: 1. **Noisy documents**: Documents in encyclopedias cover many views (such as shape, color, habitat, sound, and diet), but some of these views may not contain visual information (e.g., sound and diet), and these non - visual views are harmful to knowledge transfer. 2. **Exhaustive descriptions**: Documents comprehensively describe the possible features of a category, but a single image usually only captures a part of these features. For example, an image may only show the horn shape, color, and habitat of an antelope, while ignoring other features. 3. **Visually diverse image content**: Due to changes in shooting angles, lighting, positions, and states, images of the same category convey different semantic concepts. Aligning diverse images with the same document semantics makes accurate semantic alignment difficult. To solve these problems, the authors propose a new network named Embedding Decomposition and Partial Alignment (EmDepart), aiming to extract multi - view semantic concepts from documents and images and perform partial alignment based on semantic relevance. Specifically, the main contributions of the paper include: 1. **Proposing a new network structure** that decomposes the concepts of documents and images into multi - view semantic embeddings and performs partial alignment based on semantic relevance. This solves the sub - optimal alignment problem caused by ignoring partial relevance and provides new ideas for visual - language partial semantic alignment. 2. **Introducing a semantic decomposition module** and capturing unique local details through local - to - semantic variance loss, and enhancing the orthogonality between embeddings with multiple semantic diversity losses to solve the information redundancy problem caused by feature collapse. These losses also improve the average performance of previous methods by 4.1%. 3. **On three standard benchmark datasets**, the model consistently outperforms the state - of - the - art methods when the document sources are Wiki and Wiki+LLM. Under all metrics, the model's average performance is improved by 6.0% and 5.8% respectively. In addition, qualitative experiments show that the model can learn interpretable partial semantic associations. Through these improvements, the paper significantly improves the knowledge transfer effect in document - based zero - sample learning tasks.

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

Joint Learning of Attended Zero-Shot Features and Visual-Semantic Mapping.

Dual Collaborative Visual-Semantic Mapping for Multi-Label Zero-Shot Image Recognition

Indirect visual–semantic alignment for generalized zero-shot recognition

Agree to Disagree: Exploring Partial Semantic Consistency against Visual Deviation for Compositional Zero-Shot Learning

Learning complementary semantic information for zero-shot recognition

Visual-Semantic Aligned Bidirectional Network for Zero-Shot Learning

A Novel Perspective to Zero-shot Learning: Towards an Alignment of Manifold Structures via Semantic Feature Expansion

VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning

Visually Aligned Word Embeddings for Improving Zero-shot Learning.

Multi-modal Generative Adversarial Network for Zero-Shot Learning

Semantic Softmax Loss for Zero-Shot Learning

Semantics Disentangling for Generalized Zero-Shot Learning

Leveraging Self-Distillation and Disentanglement Network to Enhance Visual–Semantic Feature Consistency in Generalized Zero-Shot Learning

Multi-level Fusion of Multi-modal Semantic Embeddings for Zero Shot Learning

Semantic-visual shared knowledge graph for zero-shot learning

What Remains of Visual Semantic Embeddings

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths.

Semi-Supervised Low-Rank Semantics Grouping for Zero-Shot Learning