HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

Haoran Que,Feiyu Duan,Liqun He,Yutao Mou,Wangchunshu Zhou,Jiaheng Liu,Wenge Rong,Zekun Moore Wang,Jian Yang,Ge Zhang,Junran Peng,Zhaoxiang Zhang,Songyang Zhang,Kai Chen

2024-09-24

Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in <a class="link-external link-https" href="https://github.com/Quehry/HelloBench" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the inadequacy in evaluating large language models (LLMs) regarding their long text generation capabilities. Although existing research primarily focuses on LLMs' ability to understand, retrieve, and process long input texts, there is relatively little research on LLMs' ability to generate long texts. To address this gap, the authors propose a comprehensive benchmarking framework—HelloBench, designed to evaluate LLMs' performance in long text generation tasks. HelloBench is based on Bloom's Taxonomy of Educational Objectives and divides long text generation tasks into 5 sub-tasks: Open-Ended QA, Summarization, Chat, Text Completion, and Heuristic Text Generation. Additionally, the paper introduces a human alignment evaluation method called HelloEval, which significantly reduces the time and effort required for manual evaluation while maintaining a high correlation with human assessments. Through extensive experiments, the authors found that current mainstream LLMs struggle to generate long texts exceeding 4000 words. Even some open-source models, although capable of generating longer texts, suffer from severe repetition and quality degradation issues. Furthermore, the authors compared HelloEval with other traditional metrics (such as ROUGE, BLEU, etc.) and the method of using LLMs as evaluators. The results show that HelloEval has the highest correlation with human evaluations. In summary, this paper aims to fill the gap in the evaluation of long text generation capabilities and provides an effective evaluation tool.

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models

LOT: A Story-Centric Benchmark for Evaluating Chinese Long Text Understanding and Generation

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

LongGenBench: Long-context Generation Benchmark

ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models

LongLaMP: A Benchmark for Personalized Long-form Text Generation

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

AlignBench: Benchmarking Chinese Alignment of Large Language Models

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

NLPBench: Evaluating Large Language Models on Solving NLP Problems

ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

DHP Benchmark: Are LLMs Good NLG Evaluators?

Language Models can Self-Lengthen to Generate Long Texts