Xiangyan Qu,Jing Yu,Keke Gai,Jiamin Zhuang,Yuanmin Tang,Gang Xiong,Gaopeng Gou,Qi Wu
Abstract:Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in Document - based Zero - Shot Learning (ZSL), existing methods transfer knowledge by aligning the entire document semantics with images, but overlook the partial relevance between documents and images. This global alignment method leads to sub - optimal semantic alignment effects. Specifically, the problems include:
1. **Noisy documents**: Documents in encyclopedias cover many views (such as shape, color, habitat, sound, and diet), but some of these views may not contain visual information (e.g., sound and diet), and these non - visual views are harmful to knowledge transfer.
2. **Exhaustive descriptions**: Documents comprehensively describe the possible features of a category, but a single image usually only captures a part of these features. For example, an image may only show the horn shape, color, and habitat of an antelope, while ignoring other features.
3. **Visually diverse image content**: Due to changes in shooting angles, lighting, positions, and states, images of the same category convey different semantic concepts. Aligning diverse images with the same document semantics makes accurate semantic alignment difficult.
To solve these problems, the authors propose a new network named Embedding Decomposition and Partial Alignment (EmDepart), aiming to extract multi - view semantic concepts from documents and images and perform partial alignment based on semantic relevance. Specifically, the main contributions of the paper include:
1. **Proposing a new network structure** that decomposes the concepts of documents and images into multi - view semantic embeddings and performs partial alignment based on semantic relevance. This solves the sub - optimal alignment problem caused by ignoring partial relevance and provides new ideas for visual - language partial semantic alignment.
2. **Introducing a semantic decomposition module** and capturing unique local details through local - to - semantic variance loss, and enhancing the orthogonality between embeddings with multiple semantic diversity losses to solve the information redundancy problem caused by feature collapse. These losses also improve the average performance of previous methods by 4.1%.
3. **On three standard benchmark datasets**, the model consistently outperforms the state - of - the - art methods when the document sources are Wiki and Wiki+LLM. Under all metrics, the model's average performance is improved by 6.0% and 5.8% respectively. In addition, qualitative experiments show that the model can learn interpretable partial semantic associations.
Through these improvements, the paper significantly improves the knowledge transfer effect in document - based zero - sample learning tasks.