Abstract:Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at

What problem does this paper attempt to address?

This paper explores the problem of poor performance of modern large-scale language models (LLMs) in information extraction tasks, despite their excellent performance in tasks requiring complex cognitive behaviors. The authors identify two main issues with existing evaluation methods: the inaccuracy of evaluation metrics, which fail to effectively measure the semantic consistency between model outputs and ground truth answers, and the incompleteness of evaluation benchmarks, which underestimate the performance of LLMs due to limitations in manual annotation patterns. To address these issues, the paper proposes a new evaluation method called SQC-Score. This method improves the matching between model outputs and gold labels by utilizing LLMs fine-tuned with subjectively corrected data. Additionally, by combining with natural language inference (NLI) models, SQC-Score enriches the gold labels by considering correct but overlooked answers, thereby compensating for the incompleteness of benchmarks. Experimental results demonstrate that SQC-Score is more favored by human annotators than baseline metrics in three information extraction tasks and can provide a more comprehensive evaluation of the performance of the latest LLMs. The paper also points out that while LLMs show potential in some shallow information extraction tasks, challenges still remain in tasks requiring strict structured information extraction. In summary, this paper aims to provide a more accurate and comprehensive evaluation method through SQC-Score to facilitate the future development of the information extraction field.

Evaluating Generative Language Models in Information Extraction as Subjective Question Correction

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

A Survey on Evaluation of Large Language ModelsJust Accepted

Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation

A Survey on Evaluation of Large Language Models

Rethinking the Roles of Large Language Models in Chinese Grammatical Error Correction

Leveraging Large Language Models for NLG Evaluation: Advances and Challenges

Towards Reliable and Fluent Large Language Models: Incorporating Feedback Learning Loops in QA Systems

QUILL: Quotation Generation Enhancement of Large Language Models

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction

Large Language Models for Generative Information Extraction: A Survey

DHP Benchmark: Are LLMs Good NLG Evaluators?

Large Language Models Are Active Critics in NLG Evaluation

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions

Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models

PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language Models