Semantic Distillation from Neighborhood for Composed Image Retrieval

Yifan Wang,Wuliang Huang,Lei Li,Chun Yuan
DOI: https://doi.org/10.1145/3664647.3681493
2024-01-01
Abstract:The challenging task composed image retrieval targets at identifying the matched image from the multi-modal query with a reference image and a textual modifier. Most existing methods are devoted to composing the unified query representations from the query images and texts, yet the distribution gaps between the hybrid-modal query representations and visual target representations are neglected. However, directly incorporating target features on the query may cause ambiguous rankings and poor robustness due to the insufficient exploration of the distinguishments and overfitting issues. To address the above concerns, we propose a novel framework termed SemAntic Distillation from Neighborhood (SADN) for composed image retrieval. For mitigating the distribution divergences, we construct neighborhood sampling from the target domain for each query and aggregate neighborhood features with adaptive weights to restructure the query representations. Specifically, the adaptive weights are determined by the collaboration of two individual modules, as correspondence-induced adaption and divergence-based correction. Correspondence-induced adaption accounts for capturing the correlation alignments from neighbor features under the guidance of the positive representations, and the divergence-based correction regulates the weights based on the embedding distances between hard negatives and the query in the latent space. Extensive results and ablation studies on CIRR and FashionIQ validate that the proposed semantic distillation from neighborhood significantly outperforms baseline methods.
What problem does this paper attempt to address?