Abstract:Large Language Models are applied to recommendation tasks such as items to buy and news articles to read. Point of Interest is quite a new area to sequential recommendation based on language representations of multimodal datasets. As a first step to prove our concepts, we focused on restaurant recommendation based on each user's past visit history. When choosing a next restaurant to visit, a user would consider genre and location of the venue and, if available, pictures of dishes served there. We created a pseudo restaurant check-in history dataset from the Foursquare dataset and the FoodX-251 dataset by converting pictures into text descriptions with a multimodal model called LLaVA, and used a language-based sequential recommendation framework named Recformer proposed in 2023. A model trained on this semi-multimodal dataset has outperformed another model trained on the same dataset without picture descriptions. This suggests that this semi-multimodal model reflects actual human behaviours and that our path to a multimodal recommendation model is in the right direction.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to utilize multimodal data (including text, images, and geographic information) to improve the performance of Point of Interest (POI) recommendation systems. Specifically, the authors focus on the task of restaurant recommendation, aiming to more accurately recommend the next restaurant by combining users' past visit records, the geographic location of restaurants, and descriptions of dish images. ### Main Issues 1. **How to apply multimodal data to POI recommendation**: Traditional POI recommendation systems are mainly based on ID sequences. This paper attempts to use language models to process multimodal data (such as text and images) to better capture user preferences. 2. **How to handle geographic constraints**: POI recommendation needs to consider not only user interests but also geographic factors to ensure that the recommended locations are within a reasonable range for the user. 3. **How to generate and utilize descriptions of dish images**: By converting dish images into text descriptions and incorporating them into the recommendation model, the accuracy of recommendations can be improved. ### Solutions 1. **Data Preparation**: Extract user visit records and dish images from the Foursquare and FoodX-251 datasets, and use multimodal models (such as LLaVA) to convert images into text descriptions. 2. **Model Training**: Use the Recformer framework to train two models, one with dish descriptions and one without. By comparing the performance of the two, the effectiveness of multimodal data is validated. 3. **Experimental Evaluation**: Evaluate model performance using metrics such as nDCG, Recall, MRR, and AUC. The results show that the model with dish descriptions significantly outperforms the one without descriptions. ### Contributions 1. **Proposed a new multimodal sequential recommendation method**: Combining text, images, and geographic information, applied to the POI recommendation task. 2. **Introduced geographic key information**: Solved the geographic constraint problem in POI recommendation through geographic indexing and location information. 3. **Presented experimental results**: Validated the effectiveness of multimodal data in POI recommendation through experiments, providing a reference for future research. Overall, the paper significantly improves the performance of POI recommendation systems by combining multimodal data, especially visual information, and provides new ideas and methods for research in this field.

Multimodal Point-of-Interest Recommendation

Large Language Models for Next Point-of-Interest Recommendation

Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

Rec-GPT4V: Multimodal Recommendation with Large Vision-Language Models

Large Language Model Can Interpret Latent Space of Sequential Recommender

MM-Rec: Visiolinguistic Model Empowered Multimodal News Recommendation

Personalized Recommendation Systems Powered By Large Language Models: Integrating Semantic Understanding and User Preferences

Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

Attention-based sequential recommendation system using multimodal data

LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking

MM-Rec: Multimodal News Recommendation

MMREC: LLM Based Multi-Modal Recommender System

Multimodal Movie Recommendation System Using Deep Learning

Utilizing Language Models for Tour Itinerary Recommendation

Explainable next POI recommendation based on spatial-temporal disentanglement representation and pseudo profile generation

Multimodal representation learning for tourism recommendation with two-tower architecture

Recommendation by Users’ Multimodal Preferences for Smart City Applications

MMMLP: Multi-modal Multilayer Perceptron for Sequential Recommendations

Unleashing the Power of Large Language Models for Group POI Recommendations

Interpretable Embeddings for Next Point-of-Interest Recommendation via Large Language Model Question–Answering

Large Language Models as Zero-Shot Conversational Recommenders