Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

Bo Li,Gexiang Fang,Yang Yang,Quansen Wang,Wei Ye,Wen Zhao,Shikun Zhang
2023-04-23
Abstract:The capability of Large Language Models (LLMs) like ChatGPT to comprehend user intent and provide reasonable responses has made them extremely popular lately. In this paper, we focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks. Specially, we present the systematically analysis by measuring ChatGPT's performance, explainability, calibration, and faithfulness, and resulting in 15 keys from either the ChatGPT or domain experts. Our findings reveal that ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting, as evidenced by human evaluation. In addition, our research indicates that ChatGPT provides high-quality and trustworthy explanations for its decisions. However, there is an issue of ChatGPT being overconfident in its predictions, which resulting in low calibration. Furthermore, ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. We manually annotate and release the test sets of 7 fine-grained IE tasks contains 14 datasets to further promote the research. The datasets and code are available at <a class="link-external link-https" href="https://github.com/pkuserc/ChatGPT_for_IE" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The paper primarily explores the performance and related characteristics of ChatGPT in Information Extraction (IE) tasks. Specifically, the study focuses on the following aspects: 1. **Evaluation Dimensions**: To comprehensively assess ChatGPT's capabilities, the paper considers four dimensions: - **Performance**: Evaluating ChatGPT's overall performance across various IE tasks. - **Explainability**: Investigating whether ChatGPT can provide reasonable justifications for its predictions, aiding in understanding its decision-making process. - **Calibration**: Measuring the degree of uncertainty in ChatGPT's predictions, i.e., whether it is overly confident. - **Faithfulness**: Determining whether the explanations provided by ChatGPT faithfully reflect the input text. 2. **Experimental Setup**: The study employs two different setups to test ChatGPT's performance: - **Standard-IE Setting**: Providing a predefined set of labels and requiring ChatGPT to select the most appropriate answer from them. - **OpenIE Setting**: Not providing a predefined set of labels, instead relying entirely on ChatGPT's understanding to generate predictions. 3. **Research Findings**: - Under the Standard-IE Setting, ChatGPT's performance is generally inferior to other benchmark models or domain experts. - However, in the OpenIE Setting, ChatGPT exhibits surprisingly good results, especially in tasks such as entity type recognition, named entity recognition, and relation classification. - ChatGPT is capable of providing high-quality and credible explanations for its predictions, but in some cases, it shows overconfidence, leading to lower calibration. - In most cases, ChatGPT demonstrates a high degree of faithfulness to the original text. In summary, this paper aims to gain a deep understanding of ChatGPT's actual capabilities and limitations in the field of information extraction through detailed analysis and evaluation.