Abstract:With the rise of Large Language Models (LLMs) such as ChatGPT, researchers have been working on how to utilize the LLMs for better recommendations. However, although LLMs exhibit black-box and probabilistic characteristics (meaning their internal working is not visible), the evaluation framework used for assessing these LLM-based recommender systems (RS) are the same as those used for traditional recommender systems. To address this gap, we introduce the metamorphic testing for the evaluation of GPT-based RS. This testing technique involves defining of metamorphic relations (MRs) between the inputs and checking if the relationship has been satisfied in the outputs. Specifically, we examined the MRs from both RS and LLMs perspectives, including rating multiplication/shifting in RS and adding spaces/randomness in the LLMs prompt via prompt perturbation. Similarity metrics (e.g. Kendall tau and Ranking Biased Overlap(RBO)) are deployed to measure whether the relationship has been satisfied in the outputs of MRs. The experiment results on MovieLens dataset with GPT3.5 show that lower similarity are obtained in terms of Kendall $\tau$ and RBO, which concludes that there is a need of a comprehensive evaluation of the LLM-based RS in addition to the existing evaluation metrics used for traditional recommender systems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Currently, the evaluation frameworks used to assess large - language - model - based (LLM - based) recommendation systems are the same as those for traditional recommendation systems. However, this approach ignores the black - box and probabilistic characteristics of LLM, leading to potentially inaccurate evaluation results. Specifically, the internal working principles of LLM are invisible, and different outputs may be generated for the same input, which makes traditional evaluation methods unable to fully measure the performance of LLM - based RS. To address this issue, the author introduced Metamorphic Testing (MT) to evaluate GPT - based recommendation systems. Metamorphic Testing avoids the test oracle problem by defining Metamorphic Relations (MRs) between inputs and checking whether these relations are satisfied in the outputs. The paper specifically examined four Metamorphic Relations: 1. **Rating Multiplication**: Multiply all ratings by a constant. 2. **Rating Shifting**: Increase or decrease all ratings by a constant. 3. **Adding Spaces**: Insert spaces in the prompt. 4. **Adding Random Words**: Insert random words in the prompt. Through these Metamorphic Relations, researchers can better understand the stability and consistency of GPT - based recommendation systems. The experimental results show that there are significant differences in the performance of GPT - based recommendation systems under different Metamorphic Relations. In particular, when the language structure changes, the changes in the recommendation results are more obvious. This indicates that it is necessary to develop evaluation frameworks specifically for LLM - based RS, rather than simply using traditional evaluation methods. In summary, this paper aims to explore a new evaluation method - Metamorphic Testing - to more comprehensively evaluate the performance of GPT - based recommendation systems and弥补 the deficiencies of existing evaluation frameworks.

Metamorphic Evaluation of ChatGPT as a Recommender System

Evaluating ChatGPT as a Recommender System: A Rigorous Approach

Uncovering ChatGPT's Capabilities in Recommender Systems

Is ChatGPT a Good Recommender? A Preliminary Study

Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models

Chat-REC: Towards Interactive and Explainable LLMs-Augmented Recommender System

BookGPT: A General Framework for Book Recommendation Empowered by Large Language Model

Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks

ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback

LLMRec: Benchmarking Large Language Models on Recommendation Task

Exploring Recommendation Capabilities of GPT-4V(ision): A Preliminary Case Study

Can ChatGPT advance software testing intelligence? An experience report on metamorphic testing

Sparks of Artificial General Recommender (AGR): Early Experiments with ChatGPT

Sparks of Artificial General Recommender (AGR): Experiments with ChatGPT

Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation

Navigating User Experience of ChatGPT-based Conversational Recommender Systems: The Effects of Prompt Guidance and Recommendation Domain