HumanELY: Human evaluation of LLM yield, using a novel web-based evaluation tool

Raghav Awasthi,Shreya Mishra,Dwarikanath Mahapatra,Ashish Khanna,Kamal Maheshwari,Jacek Cywinski,Frank Papay,Piyush Mathur
DOI: https://doi.org/10.1101/2023.12.22.23300458
2024-12-14
Abstract:Large language models (LLMs) have caught the imagination of researchers,developers and public in general the world over with their potential for transformation. Vast amounts of research and development resources are being provided to implement these models in all facets of life. Trained using billions of parameters, various measures of their accuracy and performance have been proposed and used in recent times. While many of the automated natural language assessment parameters measure LLM output performance for use of language, contextual outputs are still hard to measure and quantify. Hence, human evaluation is still an important measure of LLM performance,even though it has been applied variably and inconsistently due to lack of guidance and resource limitations. To provide a structured way to perform comprehensive human evaluation of LLM output, we propose the first guidance and tool called HumanELY. Our approach and tool built using prior knowledge helps perform evaluation of LLM outputs in a comprehensive, consistent, measurable and comparable manner. HumanELY comprises of five key evaluation metrics: relevance, coverage, coherence, harm and comparison. Additional submetrics within these five key metrics provide for Likert scale based human evaluation of LLM outputs. Our related webtool uses this HumanELY guidance to enable LLM evaluation and provide data for comparison against different users performing human evaluation. While all metrics may not be relevant and pertinent to all outputs, it is important to assess and address their use. Lastly, we demonstrate comparison of metrics used in HumanELY against some of the recent publications in the healthcare domain. We focused on the healthcare domain due to the need to demonstrate highest levels of accuracy and lowest levels of harm in a comprehensive manner. We anticipate our guidance and tool to be used for any domain where LLMs find an use case.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies in current evaluation methods for large - language models (LLMs), especially the challenges in consistency and systematicness of human evaluation. Specifically: 1. **Limitations of existing evaluation methods**: - Many existing evaluation methods mainly rely on automated technical indicators. Although these indicators can measure the language use of language models, it is still difficult to quantify the evaluation of context - output. - Although human evaluation is regarded as the gold standard for evaluating the performance of LLMs, due to the lack of guidance and resource limitations, its application methods are diverse and inconsistent. 2. **Challenges of human evaluation**: - Human evaluation has problems such as subjectivity, differences among raters, and being labor - intensive, which makes it difficult to compare different experiments. - There is a lack of a systematic method to ensure the consistency and repeatability of evaluation, especially in different tasks and fields. 3. **Cross - domain requirements**: - Especially in fields such as healthcare, it is necessary to ensure that the content generated by LLMs has the highest accuracy and the lowest harmfulness, which poses higher requirements for evaluation methods. To solve these problems, this paper proposes a new framework and tool - **HumanELY**, aiming to provide a structured way to conduct comprehensive, consistent, measurable, and comparable human evaluation. HumanELY includes five key evaluation indicators: Relevance, Coverage, Coherence, Harm, and Comparison. Each indicator also contains specific sub - indicators for human evaluation based on the Likert scale. In addition, the paper also introduces a related Web tool. Based on the HumanELY guidelines, this tool enables users to upload reference texts and human - generated texts, so as to conveniently conduct evaluation and provide data for comparison among different users. In summary, the main goal of this paper is to improve the ability to conduct comprehensive and consistent human evaluation of LLMs' outputs by introducing the HumanELY framework and tool, thereby better supporting the application of LLMs in various fields, especially those fields with high requirements for accuracy and safety, such as healthcare.