Abstract:We hypothesize that a user's visual history with images reflecting their daily life, offers valuable insights into their interests and preferences, and can be leveraged for personalization. Among the many challenges to achieve this goal, the foremost is the diversity and noises in the visual history, containing images not necessarily related to a recommendation task, not necessarily reflecting the user's interest, or even not necessarily preference-relevant. Existing recommendation systems either rely on task-specific user interaction logs, such as online shopping history for shopping recommendations, or focus on text signals. We propose a novel approach, VisualLens, that extracts, filters, and refines image representations, and leverages these signals for personalization. We created two new benchmarks with task-agnostic visual histories, and show that our method improves over state-of-the-art recommendations by 5-10% on Hit@3, and improves over GPT-4o by 2-5%. Our approach paves the way for personalized recommendations in scenarios where traditional methods fail.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper attempts to improve the accuracy of personalized recommendations through users' visual history, that is, the photos taken or shared by users. Specifically, the paper proposes a new method named **VisualLens**, which aims to extract valuable information from users' life photos to better understand users' interests and preferences and provide more personalized recommendations. ### Background and challenges 1. **Diversity and noise**: Users' visual history usually contains a large number of images that are not directly related to specific recommendation tasks, and these images may not reflect users' actual interests or preferences. 2. **Hardware limitations**: Recording users' visual history needs to overcome hardware limitations such as battery life, thermal constraints, and storage capacity. 3. **Real - time response**: Many recommendation tasks require immediate responses, so low - latency solutions need to be developed. ### Solutions 1. **Offline history enhancement**: - **Image encoding**: Use the CLIP ViT - L/14@336px model to encode each image and generate visual embeddings. - **Generate image descriptions**: Use the frozen LLaVA - v1.6 8B model to generate image descriptions, limited to within 30 words to reduce hallucinations. - **Generate aspect words**: Extract the key feature words (aspect words) of the image, such as dome, balcony, plant, etc., which provide important details about the image. 2. **Runtime recommendation generation**: - **History retrieval**: According to the query q and the user's visual history Hu, retrieve the images related to q, and select at most w images to reduce the amount of processing. - **Preference profile generation**: Use the retrieved images, their descriptions, and aspect words to generate the user's preference profile. - **Candidate matching**: Match the user's preference profile with each candidate item to generate a confidence score for each candidate item for ranking. ### Experimental results - **Benchmark datasets**: Two new benchmark datasets, Google Review - V and Yelp - V, were created, which contain task - irrelevant visual history. - **Performance improvement**: VisualLens has a 5 - 10% improvement over the existing best method on the Hit@3 metric, and has a 1.6% and 4.6% improvement over GPT - 4o on the Google Review - V and Yelp - V benchmarks respectively. ### Conclusion VisualLens successfully improves the accuracy of personalized recommendations by leveraging users' visual history. This method opens up new ways to achieve personalized recommendations in scenarios where traditional methods are difficult to be effective.

VisualLens: Personalization through Visual History

Personalized Visualization Recommendation

A Survey of Personalised Image Retrieval and Recommendation.

A context-aware personalized travel recommendation system based on geotagged social media data mining

Personalized recommendation: From clothing to academic

Visually Explainable Recommendation

Visually-Aware Personalized Recommendation using Interpretable Image Representations

Enhancing Visual Fashion Recommendations with Users in the Loop

Merging Visual Features and Temporal Dynamics in Sequential Recommendation.

Personalized Fashion Recommendation with Visual Explanations Based on Multimodal Attention Network

Visually-aware Recommendation with Aesthetic Features

Applying Visual User Interest Profiles for Recommendation and Personalisation.

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

Leveraging Analysis History for Improved in Situ Visualization Recommendation.

VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback

COURIER: Contrastive User Intention Reconstruction for Large-Scale Visual Recommendation

Learning to Personalize Recommendation based on Customers' Shopping Intents

ML-based Visualization Recommendation: Learning to Recommend Visualizations from Data

Towards Unified Multi-Modal Personalization: Large Vision-Language Models for Generative Recommendation and Beyond

Decoding Style: Efficient Fine-Tuning of LLMs for Image-Guided Outfit Recommendation with Preference

Visually-Aware Fashion Recommendation and Design with Generative Image Models