LaMP: When Large Language Models Meet Personalization

Alireza Salemi,Sheshera Mysore,Michael Bendersky,Hamed Zamani
2024-06-05
Abstract:This paper highlights the importance of personalization in large language models and introduces the LaMP benchmark -- a novel benchmark for training and evaluating language models for producing personalized outputs. LaMP offers a comprehensive evaluation framework with diverse language tasks and multiple entries for each user profile. It consists of seven personalized tasks, spanning three text classification and four text generation tasks. We additionally propose two retrieval augmentation approaches that retrieve personal items from each user profile for personalizing language model outputs. To this aim, we study various retrieval models, including term matching, semantic matching, and time-aware methods. Extensive experiments on LaMP for zero-shot and fine-tuned language models demonstrate the efficacy of the proposed retrieval augmentation approach and highlight the impact of personalization in various natural language tasks.
Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the shortcomings of large language models (LLMs) in generating personalized outputs. Despite significant advancements in natural language processing (NLP) applications by large language models like GPT-4, these models typically adopt a "one-size-fits-all" approach for modeling and evaluation, failing to cater to the specific needs and preferences of different users. Therefore, the paper emphasizes the importance of personalization in shaping the future of NLP systems and introduces a new benchmark—LaMP (Language Model Personalization)—for training and evaluating language models capable of generating personalized outputs. ### Main Contributions 1. **LaMP Benchmark**: - Provides a comprehensive evaluation framework that includes various language tasks and multiple entries for each user profile. - Includes 7 personalized tasks: 3 text classification tasks and 4 text generation tasks. 2. **Retrieval-Enhanced Methods**: - Proposes two retrieval-enhanced methods to personalize the output of language models by retrieving personal items from each user profile. - Investigates various retrieval models, including term matching, semantic matching, and time-aware methods. 3. **Experimental Results**: - Conducts extensive experiments on the LaMP benchmark, demonstrating the effectiveness of the proposed retrieval-enhanced methods. - Highlights the impact of personalization on various natural language tasks. ### Personalization Tasks 1. **Personalized Citation Identification**: - Task Type: Binary Classification - Description: Given a paper, predict whether the user will cite one of the two candidate papers. 2. **Personalized Movie Tagging**: - Task Type: Multi-Class Classification - Description: Given a movie description and the user's tagging history, predict the tag the user will assign to the movie. 3. **Personalized Product Rating**: - Task Type: Ordinal Multi-Class Classification - Description: Given the user's historical reviews and ratings along with an input review, predict the user's rating for the product. 4. **Personalized News Headline Generation**: - Task Type: Text Generation - Description: Given a news article and the author's historical article-headline pairs, generate a news headline that matches the author's style. 5. **Personalized Academic Title Generation**: - Task Type: Text Generation - Description: Given a research paper and the author's historical paper-title pairs, generate a paper title that matches the author's style. 6. **Personalized Email Subject Generation**: - Task Type: Text Generation - Description: Given an email and the user's historical email-subject pairs, generate an email subject that matches the user's style. 7. **Personalized Tweet Rewriting**: - Task Type: Text Generation - Description: Given a tweet and the user's historical tweets, generate a tweet that matches the user's style. ### Experimental Setup - **Dataset Splits**: - User-Based Split: Ensures no shared users in training, validation, and test sets to test personalization for new users. - Time-Based Split: Splits user items chronologically, with the most recent items used to create input-output pairs and older items as user profiles to test future interactions of existing users. - **Evaluation Metrics**: - Classification Tasks: Accuracy, F1 Score, Mean Absolute Error (MAE), Root Mean Square Error (RMSE). - Generation Tasks: Rouge-1, Rouge-L. ### Retrieval-Enhanced Methods - **In-Prompt Augmentation (IPA)**: - Directly adds retrieved user profile items to the input prompt. - **Fusion-in-Decoder (FiD)**: - Encodes multiple retrieved inputs separately and then fuses these encodings in the decoder. ### Experimental Results - **Fine-Tuned Models**: - Using retrieval models like Contriever and BM25, results show that personalization significantly improves performance across all tasks. - Increase