Abstract:With the advancement of multimedia internet, the impact of visual characteristics on the decision of users to click or not within the online retail industry is increasingly significant. Thus, incorporating visual features is a promising direction for further performance improvements in click-through rate (CTR). However, experiments on our production system revealed that simply injecting the image embeddings trained with established pre-training methods only has marginal improvements. We believe that the main advantage of existing image feature pre-training methods lies in their effectiveness for cross-modal predictions. However, this differs significantly from the task of CTR prediction in recommendation systems. In recommendation systems, other modalities of information (such as text) can be directly used as features in downstream models. Even if the performance of cross-modal prediction tasks is excellent, it is challenging to provide significant information gain for the downstream models. We argue that a visual feature pre-training method tailored for recommendation is necessary for further improvements beyond existing modality features. To this end, we propose an effective user intention reconstruction module to mine visual features related to user interests from behavior histories, which constructs a many-to-one correspondence. We further propose a contrastive training method to learn the user intentions and prevent the collapse of embedding vectors. We conduct extensive experimental evaluations on public datasets and our production system to verify that our method can learn users' visual interests. Our method achieves $0.46\%$ improvement in offline AUC and $0.88\%$ improvement in Taobao GMV (Cross Merchandise Volume) with p-value$<$0.01.

Orthogonal Vector-Decomposed Disentanglement Network of Interactive Image Retrieval for Fashion Outfit Recommendation

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

Fashion Recommendation on Street Images.

Heterogeneous Hashing Network for Face Retrieval Across Image and Video Domains

FashionERN: Enhance-and-Refine Network for Composed Fashion Image Retrieval

Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking

Decompose Semantic Shifts for Composed Image Retrieval

OD-Net: Orthogonal descriptor network for multiview image keypoint matching

Toward Accurate and Realistic Outfits Visualization with Attention to Details

Personalized Fashion Recommendation with Visual Explanations Based on Multimodal Attention Network

COURIER: Contrastive User Intention Reconstruction for Large-Scale Visual Recommendation

Exploiting Distribution Constraints for Scalable and Efficient Image Retrieval

Viewpoint Disentangling and Generation for Unsupervised Object Re-ID

Semantic Distillation from Neighborhood for Composed Image Retrieval

Scale-Semantic Joint Decoupling Network for Image-text Retrieval in Remote Sensing

Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval

Fashion Recommendation and Compatibility Prediction Using Relational Network

DP-VTON: Toward Detail-Preserving Image-Based Virtual Try-on Network

High Fidelity Virtual Try-on Network Via Semantic Adaptation and Distributed Componentization

Discriminative Multi-View Interactive Image Re-Ranking.

Retrieval-based Disentangled Representation Learning with Natural Language Supervision