Evaluating Generative Language Models in Information Extraction as Subjective Question Correction

Yuchen Fan,Yantao Liu,Zijun Yao,Jifan Yu,Lei Hou,Juanzi Li
2024-04-04
Abstract:Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at
Computation and Language
What problem does this paper attempt to address?
This paper explores the problem of poor performance of modern large-scale language models (LLMs) in information extraction tasks, despite their excellent performance in tasks requiring complex cognitive behaviors. The authors identify two main issues with existing evaluation methods: the inaccuracy of evaluation metrics, which fail to effectively measure the semantic consistency between model outputs and ground truth answers, and the incompleteness of evaluation benchmarks, which underestimate the performance of LLMs due to limitations in manual annotation patterns. To address these issues, the paper proposes a new evaluation method called SQC-Score. This method improves the matching between model outputs and gold labels by utilizing LLMs fine-tuned with subjectively corrected data. Additionally, by combining with natural language inference (NLI) models, SQC-Score enriches the gold labels by considering correct but overlooked answers, thereby compensating for the incompleteness of benchmarks. Experimental results demonstrate that SQC-Score is more favored by human annotators than baseline metrics in three information extraction tasks and can provide a more comprehensive evaluation of the performance of the latest LLMs. The paper also points out that while LLMs show potential in some shallow information extraction tasks, challenges still remain in tasks requiring strict structured information extraction. In summary, this paper aims to provide a more accurate and comprehensive evaluation method through SQC-Score to facilitate the future development of the information extraction field.