NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

Miao Li,Ming-Bin Chen,Bo Tang,Shengbin Hou,Pengyu Wang,Haiying Deng,Zhiyu Li,Feiyu Xiong,Keming Mao,Peng Cheng,Yi Luo

2024-06-04

Abstract:We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of ten popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper proposes an evaluation framework called NewsBench, which aims to systematically evaluate the performance of large language models (LLMs) in Chinese news editing tasks, specifically focusing on their writing proficiency and adherence to safety guidelines. The study constructs a benchmark dataset consisting of multiple choice questions and short answer questions, with a focus on four aspects of writing proficiency (language fluency, logical coherence, style alignment, and instruction completion) and six aspects of safety (civic language, bias and discrimination, personal privacy, social harm, news ethics, and illegal activities). Using an automated evaluation protocol based on GPT-4, the paper evaluates 11 popular language models and finds that GPT-4 and ERNIE Bot perform the best, but still have shortcomings in ethical compliance in creative writing tasks. The study highlights the necessity of enhancing ethical guidance in machine-generated news content and calls for better alignment of LLMs with news standards and safety considerations. The framework and experimental results contribute to a deeper understanding of the editing capabilities of LLMs and accelerate their progress in the field of news.

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety

Applying Large Language Models for Automated Essay Scoring for Non-Native Japanese

AI-Press: A Multi-Agent News Generating and Feedback Simulation System Powered by Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

CriticEval: Evaluating Large Language Model as Critic

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models

SafetyBench: Evaluating the Safety of Large Language Models

Safety Assessment of Chinese Large Language Models

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Benchmarking Large Language Models for News Summarization

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

CHiSafetyBench: A Chinese Hierarchical Safety Benchmark for Large Language Models

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

LawBench: Benchmarking Legal Knowledge of Large Language Models

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Unveiling the Competitive Dynamics: A Comparative Evaluation of American and Chinese LLMs

LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models