Metamorphic Evaluation of ChatGPT as a Recommender System

Madhurima Khirbat,Yongli Ren,Pablo Castells,Mark Sanderson
2024-11-19
Abstract:With the rise of Large Language Models (LLMs) such as ChatGPT, researchers have been working on how to utilize the LLMs for better recommendations. However, although LLMs exhibit black-box and probabilistic characteristics (meaning their internal working is not visible), the evaluation framework used for assessing these LLM-based recommender systems (RS) are the same as those used for traditional recommender systems. To address this gap, we introduce the metamorphic testing for the evaluation of GPT-based RS. This testing technique involves defining of metamorphic relations (MRs) between the inputs and checking if the relationship has been satisfied in the outputs. Specifically, we examined the MRs from both RS and LLMs perspectives, including rating multiplication/shifting in RS and adding spaces/randomness in the LLMs prompt via prompt perturbation. Similarity metrics (e.g. Kendall tau and Ranking Biased Overlap(RBO)) are deployed to measure whether the relationship has been satisfied in the outputs of MRs. The experiment results on MovieLens dataset with GPT3.5 show that lower similarity are obtained in terms of Kendall $\tau$ and RBO, which concludes that there is a need of a comprehensive evaluation of the LLM-based RS in addition to the existing evaluation metrics used for traditional recommender systems.
Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Currently, the evaluation frameworks used to assess large - language - model - based (LLM - based) recommendation systems are the same as those for traditional recommendation systems. However, this approach ignores the black - box and probabilistic characteristics of LLM, leading to potentially inaccurate evaluation results. Specifically, the internal working principles of LLM are invisible, and different outputs may be generated for the same input, which makes traditional evaluation methods unable to fully measure the performance of LLM - based RS. To address this issue, the author introduced Metamorphic Testing (MT) to evaluate GPT - based recommendation systems. Metamorphic Testing avoids the test oracle problem by defining Metamorphic Relations (MRs) between inputs and checking whether these relations are satisfied in the outputs. The paper specifically examined four Metamorphic Relations: 1. **Rating Multiplication**: Multiply all ratings by a constant. 2. **Rating Shifting**: Increase or decrease all ratings by a constant. 3. **Adding Spaces**: Insert spaces in the prompt. 4. **Adding Random Words**: Insert random words in the prompt. Through these Metamorphic Relations, researchers can better understand the stability and consistency of GPT - based recommendation systems. The experimental results show that there are significant differences in the performance of GPT - based recommendation systems under different Metamorphic Relations. In particular, when the language structure changes, the changes in the recommendation results are more obvious. This indicates that it is necessary to develop evaluation frameworks specifically for LLM - based RS, rather than simply using traditional evaluation methods. In summary, this paper aims to explore a new evaluation method - Metamorphic Testing - to more comprehensively evaluate the performance of GPT - based recommendation systems and弥补 the deficiencies of existing evaluation frameworks.