F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Yu Sun,Keyu Chen,Shujie Wang,Peiji Li,Qipeng Guo,Hang Yan,Xipeng Qiu,Xuanjing Huang,Dahua Lin

2024-08-20

Abstract:Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs' fundamental abilities.

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of neglecting the fundamental capabilities of models in current large language model (LLMs) evaluation benchmarks. Specifically, existing evaluation benchmarks mainly focus on instruction-following ability and conversational ability, while ignoring the basic capabilities that have already emerged during the pre-training phase. The paper proposes F-Eval, a bilingual evaluation benchmark designed to assess the fundamental capabilities of LLMs, including expressive ability, common sense, and logic. F-Eval includes various task types, such as multiple-choice questions, open-ended objective tasks, reference-based subjective tasks, and reference-free subjective tasks. It also introduces new evaluation methods to replace API model scoring, aiming to improve the consistency and discriminative power of the evaluations. Experimental results show that F-Eval outperforms other evaluation methods in terms of correlation with human judgment and discriminative power in reference-free subjective tasks. Additionally, the paper explores the impact of different model sizes, dimensions, and normalization methods on the fundamental capabilities of LLMs.

F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

What is the best model? Application-driven Evaluation for Large Language Models

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Evaluating Large Language Models with fmeval

A Survey on Evaluation of Large Language ModelsJust Accepted

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

LalaEval: A Holistic Human Evaluation Framework for Domain-Specific Large Language Models

A Survey on Evaluation of Large Language Models

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

CJEval: A Benchmark for Assessing Large Language Models Using Chinese Junior High School Exam Data

Evaluating Large Language Models at Evaluating Instruction Following

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models