Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

Prasoon Bajpai,Niladri Chatterjee,Subhabrata Dutta,Tanmoy Chakraborty

2024-09-21

Abstract:Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify Llama-3-70B as a strong competitor, often surpassing GPT-4 Turbo in various evaluation aspects. We also find that even the GPT models exhibit a general incompetence in reliably verifying LLM responses. Moreover, we observe an alarming trend where human evaluators are deceived by incorrect responses from GPT-4 Turbo.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the issue of evaluating the reliability of current large-scale language models (LLMs) as scientific communicators. Specifically, the paper focuses on the following research questions: 1. **Can existing LLMs successfully and faithfully answer scientific reasoning questions that require understanding the nuances of scientific knowledge?** 2. **Can LLMs effectively avoid assertive responses when faced with open scientific questions?** 3. **Can LLMs successfully verify responses generated by other LLMs?** 4. **Are human evaluators misled by incorrect but confident answers from LLMs on complex scientific questions?** To answer these questions, the authors introduce a new dataset, SCiPS-QA (Scientifically Challenging Problems - Question Answering), which contains 742 complex Boolean scientific questions covering multiple disciplines such as physics, chemistry, mathematics, theoretical computer science, astronomy, economics, and biology. These questions include both closed questions with definite answers and open questions without clear answers. Using this dataset, the authors benchmark various open-source and proprietary LLMs from the OpenAI GPT series, Meta Llama-2 and Llama-3 series, and the Mistral series, evaluating their performance in terms of correctness, faithfulness, and hallucination. Additionally, the paper explores the self-verification capability of LLMs when generating responses and evaluates responses generated by GPT-4 Turbo through human evaluators. It finds that even when faced with incorrect but confident answers, human evaluators can be misled.

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios

Large Language Models as Evaluators for Scientific Synthesis

Exploring the psychology of LLMs' Moral and Legal Reasoning

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

"Which LLM should I use?": Evaluating LLMs for tasks performed by Undergraduate Computer Science Students

Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics

Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models

Understanding and Mitigating Language Confusion in LLMs

Are Large Language Models Good Statisticians?

Exploring the Responses of Large Language Models to Beginner Programmers' Help Requests

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Assessing Large Language Models on Climate Information

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Easy Problems That LLMs Get Wrong

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Evaluating Language Models for Generating and Judging Programming Feedback

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve