ClimateX: Do LLMs Accurately Assess Human Expert Confidence in Climate Statements?

Romain Lacombe,Kerrie Wu,Eddie Dilworth
DOI: https://doi.org/10.48550/arXiv.2311.17107
2023-11-28
Abstract:Evaluating the accuracy of outputs generated by Large Language Models (LLMs) is especially important in the climate science and policy domain. We introduce the Expert Confidence in Climate Statements (ClimateX) dataset, a novel, curated, expert-labeled dataset consisting of 8094 climate statements collected from the latest Intergovernmental Panel on Climate Change (IPCC) reports, labeled with their associated confidence levels. Using this dataset, we show that recent LLMs can classify human expert confidence in climate-related statements, especially in a few-shot learning setting, but with limited (up to 47%) accuracy. Overall, models exhibit consistent and significant over-confidence on low and medium confidence statements. We highlight implications of our results for climate communication, LLMs evaluation strategies, and the use of LLMs in information retrieval systems.
Machine Learning,Artificial Intelligence,Computation and Language,Computers and Society,Information Retrieval
What problem does this paper attempt to address?