Orthogonal Vector-Decomposed Disentanglement Network of Interactive Image Retrieval for Fashion Outfit Recommendation

Chen,Jie Guo,Bin Song,Tong Zhang
DOI: https://doi.org/10.1145/3552468.3555362
2022-01-01
Abstract:Interactive image retrieval for fashion outfit recommendation is a challenging task, which aims to search for the target desired image according to a multi-modal query (a reference image and a modification text). Previous studies focus on exploring effective feature composing methods to achieve similarity matching between different modalities. However, the existence of feature redundancy and the semantic inconsistency between modalities introduces many task-irrelevant information. It is intractable to correctly identify the particular information to be modified and will inevitably introduce noise disturbances which lead to suboptimal performance. To this end, we present a novel Orthogonal Vector-Decomposed Disentanglement Network (OVDDN) for image retrieval. It proposes to leverage the disentangled parts to learn a controllable denoising embedding space. First, we design an orthogonal disentanglement module. It is applied to both image and text features to decouple them into two independent components (invariant and specific) through orthogonal constraints. A similarity metric loss ensures semantic consistency of paired images. Then, an attention network generates composition of the reference image invariant part and text task-related part to match the target one. Finally, a differential feature alignment module maintain the cross-modal semantic consistency. Extensive experiments conducted on three benchmark datasets denote the OVDDN achieving the consistently superior performance. Ablation analyses further verify the effectiveness of our proposed model.
What problem does this paper attempt to address?