Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Ziming Liu,Jingcai Guo,Song Guo,Xiaocheng Lu
2024-08-25
Abstract:This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the challenges in multi - label zero - shot learning (MLZSL). Specifically, the paper focuses on the identification problem of classes not seen during the training phase, that is, how to use the information of known classes and auxiliary knowledge (such as semantic information) to identify new, unseen classes. Existing methods usually analyze the spatial or semantic feature relationships of various known classes in samples and transfer the learned models to unknown classes. However, these methods often overlook the integrity of local and global features, resulting in problems when the model identifies the main image components. To overcome these problems, the authors propose a new comprehensive visual - semantic framework called Epsilon. This framework aims to fully utilize the relationship between local and global features to achieve more accurate and robust visual - semantic projection. The specific contributions are as follows: 1. **Visual Prompt Learning**: Grouping and aggregating features through visual prompt learning solves the problem of detail loss caused by the use of spatial attention in traditional methods and greatly ensures the integrity of semantics. 2. **Global Forward Propagation Module**: Using the global forward propagation module (Global Forward Propagation, GFP) greatly enriches the diversity of global features and improves the richness of global information. 3. **Fusion Feature Processing**: The above two modules work together to process the fused features. Extensive experiments show that this method outperforms other state - of - the - art MLZSL models on the NUS - Wide and Open - Images - v4 datasets. ### Main Technical Details #### 1. Group Prompts Aggregation Module - **Input Image Feature Extraction**: Use the pre - trained ViT - B/16 model to extract the feature \(F_i\) of the input image \(I_i\). - **Group Prompt Design**: Design multiple updatable group prompts \(GT_i\), with the number being \(M\). - **Feature Aggregation**: Input the image feature \(F_i\) and group prompt \(GT_i\) into the Transformer encoder for aggregation to obtain the aggregated image feature \(GTQ_i\). - **Fine - grained Feature Recombination**: Recombine the updated group prompt \(GTQ_i\) with the original feature \(F_i\) to further refine the semantic information of each group and complete the local visual - semantic projection. #### 2. Global Forward Propagation Module - **Feature Block Partitioning**: Partition the feature \(F\) into \(M\) feature blocks \(FG1_i, FG2_i,\ldots, FGM_i\). - **Multi - layer Perceptron Re - representation**: Use a multi - layer perceptron (MLP) to re - represent each feature block to obtain the weight representation \(WM_i\). - **Weight Normalization**: Normalize the weights through the softmax function to obtain the weight representation \(AM_i\) for each feature point. - **Global Semantic Generation**: Perform a dot - product operation between the normalized weights and the original feature blocks to obtain the global semantic \(SM_i\). - **Feature Fusion**: Concatenate the local and global semantic features to obtain the final feature representation \(GS_i\). #### 3. Loss Function - **Rank Loss**: Use the RankNet loss function \(L_r\) to maximize the distance between the classes present in the input image and those not present. - **Regularization Loss**: Introduce the regularization loss function \(L_{\text{reg}}\) to construct the correlation between input semantic vectors. - **Total Loss Function**: Combine the rank loss and regularization loss to obtain the final loss function \(L\). ### Experimental Results - **Dataset**: Experiments were carried out on the NUS - Wide and Open - Images - v4 datasets. - **Evaluation Metric**: Use the mean average precision (mAP)