Assessing Large Language Models on Climate Information

Jannis Bulian,Mike S. Schäfer,Afra Amini,Heidi Lam,Massimiliano Ciaramita,Ben Gaiarin,Michelle Chen Hübscher,Christian Buck,Niels G. Mede,Markus Leippold,Nadine Strauß
2024-05-28
Abstract:As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.
Computation and Language,Artificial Intelligence,Computers and Society,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate the performance of large language models (LLMs) in disseminating climate change information. Specifically, the paper proposes a comprehensive evaluation framework based on science communication research to assess LLMs' responses to climate change-related questions. The framework emphasizes both presentation quality and epistemic quality, and conducts a detailed analysis of LLMs' generated content through eight dimensions and 30 specific questions. ### Main Issues and Challenges 1. **Challenges in Conveying Climate Information**: - Climate change is a complex and evolving scientific field that encompasses a vast amount of specialized knowledge, complexity, and uncertainty. - The abundance of AI-generated content in the digital media environment, limited attention spans, and adversarial dynamics further exacerbate these challenges. - Conveying climate information requires it to be both accurate and easy to understand to improve public climate literacy. 2. **Evaluation of LLMs' Performance**: - Current evaluations of LLMs mainly focus on the quality of surface form, while assessments of their epistemic quality are relatively insufficient. - A systematic approach is needed to evaluate LLMs' performance in disseminating climate change information, including their accuracy, specificity, completeness, and uncertainty. 3. **Design of the Evaluation Framework**: - The evaluation framework needs to cover both presentation quality and epistemic quality to ensure comprehensiveness and granularity. - Evaluation tasks should reflect real-world issues where AI can supplement and enhance human performance. 4. **Innovation in Evaluation Methods**: - Introduced a scalable supervision protocol that relies on AI assistance and raters with relevant educational backgrounds. - Evaluated the performance of several state-of-the-art LLMs on a diverse set of climate change questions, revealing significant gaps between surface form quality and epistemic quality. ### Main Findings - **Epistemic Quality Lower than Presentation Quality**: Current LLMs perform poorly in terms of epistemic quality of climate change information, especially in specificity, completeness, and uncertainty. - **Role of AI Assistance**: Introducing AI assistance can significantly improve rating quality, but its broad impact on raters requires further study. - **Impact of Prompts**: Including evaluation criteria in prompts can improve LLMs' performance in epistemic and tonal aspects, but may also lead to a decline in surface form quality. - **Impact of Question Sources**: The source of questions (e.g., Skeptical Science, Google Trends, and Wikipedia) has little impact on LLMs' performance, though scores for Wikipedia questions are slightly lower. ### Conclusion This paper proposes a comprehensive framework for evaluating LLMs' performance in disseminating climate change information, emphasizing the importance of epistemic quality. The findings indicate that while LLMs excel in surface form quality, there is significant room for improvement in epistemic quality. Introducing AI assistance and optimizing prompts can effectively enhance LLMs' performance, but potential negative impacts also need to be addressed.