From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

Yongqiang Ma,Lizhi Qing,Jiawei Liu,Yangyang Kang,Yue Zhang,Wei Lu,Xiaozhong Liu,Qikai Cheng
2024-04-11
Abstract:Evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed ``Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task, ``Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.
Computation and Language,Information Retrieval
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to shift from model - centered evaluation to human - centered evaluation in the application of large language models (LLMs). Specifically, traditional evaluation methods mainly focus on model development and provide out - of - context numerical scores, ignoring the user experience. Therefore, this paper proposes a new evaluation metric - "Revision Distance", aiming to evaluate text quality by simulating the revision and editing in the human writing process, thereby providing more detailed and more explanatory evaluation results rather than just simple scores. ### Main Contributions: 1. **Emphasize the Perspective of End - Users**: Highlight the perspective of end - users in the text evaluation of writing - assistance applications based on LLMs. 2. **Propose a Human - Centered Evaluation Metric**: The proposed "Revision Distance" metric is consistent with actual human editing behaviors, providing self - explanation and fine - grained insights for developers and end - users. 3. **Extensive Experimental Verification**: Verify the effectiveness and practicality of the proposed human - centered evaluation metric through a variety of test tasks. ### Method Overview: - **Revision Distance**: Quantify text quality by calculating how many revisions the text generated by a large language model needs to reach a predefined quality threshold. These revision edits are generated by another large language model (as a user agent), simulating the editing behavior of real users. - **Reference - Text and Non - Reference - Text Settings**: In the case of having a reference text, use human - written text or ChatGPT output as the standard; in the case of no reference text, require the model to improve the given text to an ideal state. ### Experimental Results: - **Reference - Text Settings**: - For simple writing tasks (such as email, letter, article generation), the Revision Distance is consistent with the results of other benchmark metrics (such as ROUGE, BERT - Score, GPT - Score), but provides more detailed feedback. - For complex academic writing tasks, the Revision Distance can still provide stable and reliable evaluation results, while other metrics may perform poorly. - **Non - Reference - Text Settings**: In the UltraFeedback dataset, the Revision Distance is consistent with human judgment in approximately 76% of cases, indicating that the selected responses usually require fewer revisions. ### Conclusion: This paper successfully shifts text evaluation from model - centered to human - centered by introducing the new metric "Revision Distance", which not only provides more detailed and transparent evaluation results but also provides valuable feedback for future model improvement. However, this method also has some limitations, such as high computational and financial costs. In the future, smaller specialized models can be explored to reduce costs and improve efficiency.