Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

SeongYeub Chu,JongWoo Kim,MunYong Yi

2024-09-11

Abstract:This study introduces \textbf{InteractEval}, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at \textbf{\url{<a class="link-external link-https" href="https://github.com/BBeeChu/InteractEval.git" rel="external noopener nofollow">this https URL</a>}}.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address several key issues in text evaluation. Specifically: 1. **Improving Text Evaluation Methods**: The paper proposes a new framework called InteractEval, which combines the insights of human experts and large language models (LLMs) through the "Think-Aloud" (TA) method to generate attributes for checklist-based text evaluation. This approach aims to overcome the limitations of relying solely on humans or machines, reduce bias, and improve evaluation quality. 2. **Enhancing Evaluation Performance**: By combining the strengths of humans in internal quality (coherence and fluency) and LLMs in external alignment (consistency and relevance), InteractEval outperforms traditional non-LLMs baselines and LLMs baselines in four dimensions (coherence, fluency, consistency, and relevance). 3. **Validating the Effectiveness of the TA Method**: Experiments have validated the role of the TA method in promoting divergent thinking in both humans and LLMs, thereby generating a broader and more relevant set of attributes to enhance text evaluation performance. Overall, the study emphasizes the importance of effectively combining humans and LLMs in an automated checklist-based text evaluation framework.

Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

CheckEval: Robust Evaluation Framework using Large Language Model via Checklist

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

Large Language Models as Partners in Student Essay Evaluation

EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Can Large Language Models Be an Alternative to Human Evaluations?

Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

A Closer Look into Using Large Language Models for Automatic Evaluation

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Evaluating Human-Language Model Interaction

StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text

TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

ALLURE: Auditing and Improving LLM-based Evaluation of Text using Iterative In-Context-Learning

Evaluating Large Language Models at Evaluating Instruction Following

How Reliable Are Automatic Evaluation Methods for Instruction-Tuned LLMs?

Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment