Dynamic Soft Labeling for Visual Semantic Embedding

Jiaao Yu,Yunlai Ding,Junyu Dong,Yuezun Li
DOI: https://doi.org/10.1145/3652583.3658068
2024-01-01
Abstract:Visual Semantic Embedding (VSE) is a prominent approach in image-text retrieval, aiming to learn a deep embedding space that aligns visual data with semantic text labels. However, current VSE methods oversimplify the retrieval task, treating it as a binary classification problem with triplet loss constraints. This ignores the semantic correlation between pairs of mismatched samples and fails to capture the similarity gradient between samples. In addition, hard constraints on negative samples with high semantic relevance can be detrimental to the model's representational capabilities. To address these limitations, we propose a novel training strategy that introduces dynamic soft labels without additional annotations. This captures the correlation between positive and negative sample pairs and guides feature representation learning using the Soft Negative Alignment Loss (SNAL). SNAL fully takes into account the influence by similar negative samples, enhancing the representation of cross-modal data. In addition, we propose the Stepwise Negative Decoupling Loss (SNDL) to increase the distance between positive and negative samples. Stepwise decoupling of negative samples can be adaptively distanced based on their semantic relevance to the anchor, resulting in a wider distribution of sample features in the common space. Experiments on Flickr30K and MS-COCO datasets validate the effectiveness of our dynamic soft labeling (DSL) methods, demonstrating the importance of considering complex relationships between sample pairs and the limitations of rigid negative sample categorization based on subjective annotations.
What problem does this paper attempt to address?