Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

Prasoon Bajpai,Niladri Chatterjee,Subhabrata Dutta,Tanmoy Chakraborty
2024-09-21
Abstract:Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify Llama-3-70B as a strong competitor, often surpassing GPT-4 Turbo in various evaluation aspects. We also find that even the GPT models exhibit a general incompetence in reliably verifying LLM responses. Moreover, we observe an alarming trend where human evaluators are deceived by incorrect responses from GPT-4 Turbo.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the issue of evaluating the reliability of current large-scale language models (LLMs) as scientific communicators. Specifically, the paper focuses on the following research questions: 1. **Can existing LLMs successfully and faithfully answer scientific reasoning questions that require understanding the nuances of scientific knowledge?** 2. **Can LLMs effectively avoid assertive responses when faced with open scientific questions?** 3. **Can LLMs successfully verify responses generated by other LLMs?** 4. **Are human evaluators misled by incorrect but confident answers from LLMs on complex scientific questions?** To answer these questions, the authors introduce a new dataset, SCiPS-QA (Scientifically Challenging Problems - Question Answering), which contains 742 complex Boolean scientific questions covering multiple disciplines such as physics, chemistry, mathematics, theoretical computer science, astronomy, economics, and biology. These questions include both closed questions with definite answers and open questions without clear answers. Using this dataset, the authors benchmark various open-source and proprietary LLMs from the OpenAI GPT series, Meta Llama-2 and Llama-3 series, and the Mistral series, evaluating their performance in terms of correctness, faithfulness, and hallucination. Additionally, the paper explores the self-verification capability of LLMs when generating responses and evaluates responses generated by GPT-4 Turbo through human evaluators. It finds that even when faced with incorrect but confident answers, human evaluators can be misled.