SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text

Reshmi Ghosh,Tianyi Yao,Lizzy Chen,Sadid Hasan,Tianwei Chen,Dario Bernal,Huitian Jiao,H M Sajjad Hossain
2024-11-25
Abstract:Large Language Model (LLM) integrations into applications like Microsoft365 suite and Google Workspace for creating/processing documents, emails, presentations, etc. has led to considerable enhancements in productivity and time savings. But as these integrations become more more complex, it is paramount to ensure that the quality of output from the LLM-integrated applications are relevant and appropriate for use. Identifying the need to develop robust evaluation approaches for natural language generation, wherein references/ground labels doesn't exist or isn't amply available, this paper introduces a novel framework called "SAGEval" which utilizes a critiquing Agent to provide feedback on scores generated by LLM evaluators. We show that the critiquing Agent is able to rectify scores from LLM evaluators, in absence of references/ground-truth labels, thereby reducing the need for labeled data even for complex NLG evaluation scenarios, like the generation of JSON-structured forms/surveys with responses in different styles like multiple choice, likert ratings, single choice questions, etc.
Computation and Language,Multiagent Systems
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively evaluate open - text without reference texts or annotated data in natural language generation (NLG) tasks. Specifically, with the integration of large - language models (LLM) in various application scenarios, such as Microsoft 365 suite and Google Workspace for creating/processing documents, emails, presentations, etc., the output quality of these applications becomes crucial. However, for those complex NLG scenarios without reference texts or annotated data, developing effective evaluation methods becomes a challenge. The paper introduces a new framework named SAGEval, which utilizes a critiquing Agent to evaluate the open - text generated by LLM and adjusts the scores in the absence of reference texts, thereby reducing the dependence on annotated data. ### Main contributions of the paper: 1. **Proposing a new framework**: SAGEval is a role - based LLM agent evaluation framework for evaluating natural language generation content that is open - ended and without reference texts. Compared with existing LLM evaluation methods, SAGEval is more in line with human preferences. 2. **Demonstrating capabilities**: Through the proposed framework, it is demonstrated that the LLM evaluator can assume roles and criticize scores without reference documents, making up for the deficiencies of existing popular LLM evaluation methods (such as G - Eval). 3. **Releasing a dataset**: To facilitate reproducibility, the paper also releases a dataset and related human annotations. 4. **Expanding evaluation dimensions**: In addition to scoring natural language texts, SAGEval can also propose new evaluation aspects to comprehensively increase the coverage of evaluation. ### Main findings: 1. **Critiquing agent modifies scores**: After the introduction of the critiquing agent SAGE Agent, the score distribution has changed, and the scores have shifted from the higher 4 and 5 points to the lower 3 and 2 points. 2. **Consistency with human judgment**: The scores after being modified by the critiquing agent are more consistent with the scores of human annotators, especially in terms of accuracy (Accuracy), audience understandability (Audience Understandability) and audience engagement (Audience Engagement). The score consistency of SAGEval is about 20% higher than that of other methods. 3. **Identifying gaps in evaluation criteria**: The SAGE Agent suggests adding new evaluation aspects, such as creativity score (Creativity Score) and content quality score (Content Quality Score), to improve the comprehensiveness of evaluation. ### Conclusion: The SAGEval framework is the first framework to comprehensively study the problem of evaluating open - ended and non - reference - text, and proposes an evaluation method that includes a critiquing agent, which can comprehensively evaluate the open - text generated by LLM without labels. This method reduces the dependence on labels or reference texts and paves new ways for the integration of LLM in products.