Abstract:In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted extensive experiments across around 30 mainstream LLMs and observed that the current LLMs lack long text generation capabilities. Specifically, first, regardless of whether the instructions include explicit or implicit length constraints, we observe that most LLMs cannot generate text that is longer than 4000 words. Second, we observe that while some LLMs can generate longer text, many issues exist (e.g., severe repetition and quality degradation). Third, to demonstrate the effectiveness of HelloEval, we compare HelloEval with traditional metrics (e.g., ROUGE, BLEU, etc.) and LLM-as-a-Judge methods, which show that HelloEval has the highest correlation with human evaluation. We release our code in <a class="link-external link-https" href="https://github.com/Quehry/HelloBench" rel="external noopener nofollow">this https URL</a>.

Analysing Data-To-Text Generation Benchmarks

Do Text-to-Vis Benchmarks Test Real Use of Visualisations?

Efficacy of Synthetic Data as a Benchmark

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

Evaluating Language Models as Synthetic Data Generators

Data-driven Natural Language Generation: Paving the Road to Success

Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation

Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension

Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

On the Effectiveness of Automated Metrics for Text Generation Systems

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

CEval: A Benchmark for Evaluating Counterfactual Text Generation

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks