Abstract:Evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications. Conventional evaluation methods, typically designed primarily for LLM development, yield numerical scores that ignore the user experience. Therefore, our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications. Our proposed metric, termed ``Revision Distance,'' utilizes LLMs to suggest revision edits that mimic the human writing process. It is determined by counting the revision edits generated by LLMs. Benefiting from the generated revision edit details, our metric can provide a self-explained text evaluation result in a human-understandable manner beyond the context-independent score. Our results show that for the easy-writing task, ``Revision Distance'' is consistent with established metrics (ROUGE, Bert-score, and GPT-score), but offers more insightful, detailed feedback and better distinguishes between texts. Moreover, in the context of challenging academic writing tasks, our metric still delivers reliable evaluations where other metrics tend to struggle. Furthermore, our metric also holds significant potential for scenarios lacking reference texts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to shift from model - centered evaluation to human - centered evaluation in the application of large language models (LLMs). Specifically, traditional evaluation methods mainly focus on model development and provide out - of - context numerical scores, ignoring the user experience. Therefore, this paper proposes a new evaluation metric - "Revision Distance", aiming to evaluate text quality by simulating the revision and editing in the human writing process, thereby providing more detailed and more explanatory evaluation results rather than just simple scores. ### Main Contributions: 1. **Emphasize the Perspective of End - Users**: Highlight the perspective of end - users in the text evaluation of writing - assistance applications based on LLMs. 2. **Propose a Human - Centered Evaluation Metric**: The proposed "Revision Distance" metric is consistent with actual human editing behaviors, providing self - explanation and fine - grained insights for developers and end - users. 3. **Extensive Experimental Verification**: Verify the effectiveness and practicality of the proposed human - centered evaluation metric through a variety of test tasks. ### Method Overview: - **Revision Distance**: Quantify text quality by calculating how many revisions the text generated by a large language model needs to reach a predefined quality threshold. These revision edits are generated by another large language model (as a user agent), simulating the editing behavior of real users. - **Reference - Text and Non - Reference - Text Settings**: In the case of having a reference text, use human - written text or ChatGPT output as the standard; in the case of no reference text, require the model to improve the given text to an ideal state. ### Experimental Results: - **Reference - Text Settings**: - For simple writing tasks (such as email, letter, article generation), the Revision Distance is consistent with the results of other benchmark metrics (such as ROUGE, BERT - Score, GPT - Score), but provides more detailed feedback. - For complex academic writing tasks, the Revision Distance can still provide stable and reliable evaluation results, while other metrics may perform poorly. - **Non - Reference - Text Settings**: In the UltraFeedback dataset, the Revision Distance is consistent with human judgment in approximately 76% of cases, indicating that the selected responses usually require fewer revisions. ### Conclusion: This paper successfully shifts text evaluation from model - centered to human - centered by introducing the new metric "Revision Distance", which not only provides more detailed and transparent evaluation results but also provides valuable feedback for future model improvement. However, this method also has some limitations, such as high computational and financial costs. In the future, smaller specialized models can be explored to reduce costs and improve efficiency.

From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications

RepEval: Effective Text Evaluation with LLM Representation

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

RevisEval: Improving LLM-as-a-Judge via Response-Adapted References

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Human-Centered Design Recommendations for LLM-as-a-Judge

A User-Centric Benchmark for Evaluating Large Language Models.

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Style Over Substance: Evaluation Biases for Large Language Models

Brazilian version of the Problem Areas in Diabetes Scale (B-PAID): validation and identification of individuals at high risk for emotional distress.

Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring

Human-AI Collaborative Essay Scoring: A Dual-Process Framework with LLMs

Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions

A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models

Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks

A Closer Look into Using Large Language Models for Automatic Evaluation

CriticEval: Evaluating Large Language Model as Critic

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models