Think Together and Work Better: Combining Humans' and LLMs' Think-Aloud Outcomes for Effective Text Evaluation

SeongYeub Chu,JongWoo Kim,MunYong Yi
2024-09-11
Abstract:This study introduces \textbf{InteractEval}, a framework that integrates human expertise and Large Language Models (LLMs) using the Think-Aloud (TA) method to generate attributes for checklist-based text evaluation. By combining human flexibility and reasoning with LLM consistency, InteractEval outperforms traditional non-LLM-based and LLM-based baselines across four distinct dimensions, consisting of Coherence, Fluency, Consistency, and Relevance. The experiment also investigates the effectiveness of the TA method, showing that it promotes divergent thinking in both humans and LLMs, leading to the generation of a wider range of relevant attributes and enhance text evaluation performance. Comparative analysis reveals that humans excel at identifying attributes related to internal quality (Coherence and Fluency), but LLMs perform better at those attributes related to external alignment (Consistency and Relevance). Consequently, leveraging both humans and LLMs together produces the best evaluation outcomes. In other words, this study emphasizes the necessity of effectively combining humans and LLMs in an automated checklist-based text evaluation framework. The code is available at \textbf{\url{<a class="link-external link-https" href="https://github.com/BBeeChu/InteractEval.git" rel="external noopener nofollow">this https URL</a>}}.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address several key issues in text evaluation. Specifically: 1. **Improving Text Evaluation Methods**: The paper proposes a new framework called InteractEval, which combines the insights of human experts and large language models (LLMs) through the "Think-Aloud" (TA) method to generate attributes for checklist-based text evaluation. This approach aims to overcome the limitations of relying solely on humans or machines, reduce bias, and improve evaluation quality. 2. **Enhancing Evaluation Performance**: By combining the strengths of humans in internal quality (coherence and fluency) and LLMs in external alignment (consistency and relevance), InteractEval outperforms traditional non-LLMs baselines and LLMs baselines in four dimensions (coherence, fluency, consistency, and relevance). 3. **Validating the Effectiveness of the TA Method**: Experiments have validated the role of the TA method in promoting divergent thinking in both humans and LLMs, thereby generating a broader and more relevant set of attributes to enhance text evaluation performance. Overall, the study emphasizes the importance of effectively combining humans and LLMs in an automated checklist-based text evaluation framework.