Evaluating Consistency and Reasoning Capabilities of Large Language Models

Yash Saxena,Sarthak Chopra,Arunendra Mani Tripathi
2024-04-25
Abstract:Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations. Queries from the dataset are presented as prompts to the LLMs, and the generated responses are evaluated against the ground truth answers. Additionally, explanations are generated to assess the models' reasoning abilities. Consistency is evaluated by repeatedly presenting the same query to the models and observing for variations in their responses. For measuring reasoning capabilities, the generated explanations are compared to the ground truth explanations using metrics such as BERT, BLEU, and F-1 scores. The findings reveal that proprietary models generally outperform public models in terms of both consistency and reasoning capabilities. However, even when presented with basic general knowledge questions, none of the models achieved a score of 90\% in both consistency and reasoning. This study underscores the direct correlation between consistency and reasoning abilities in LLMs and highlights the inherent reasoning challenges present in current language models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate and compare the capabilities of large language models (LLMs) in terms of consistency and reasoning ability. Specifically, the paper focuses on the following points: 1. **Consistency problem**: LLMs are inconsistent when generating responses, that is, for the same query, the model may give different answers. 2. **Lack of reasoning ability**: LLMs perform poorly in providing explanations and reasoning to support their answers, and often generate wrong or misleading information. This phenomenon is called "hallucination". To solve these problems, the paper uses the BoolQ dataset as a benchmark. This dataset contains a series of yes - no questions, their corresponding correct answers and explanations. By inputting these questions as prompts to different LLMs and evaluating the answers and explanations they generate, the researchers hope to answer the following questions: - How consistent are LLMs? Can they maintain the same results in multiple queries? - How is the reasoning ability of LLMs? Are the explanations they generate reasonable and accurate? - Are there differences in consistency and reasoning ability between public models and proprietary models? ### Method overview To evaluate consistency, the researchers made the same query three times for each model and counted whether the answers were consistent each time. For the evaluation of reasoning ability, metrics such as BERT, BLEU and F - 1 were used to compare the similarity between the generated explanations and the standard explanations provided in the dataset. ### Main findings - **Consistency**: All models showed high inconsistency when providing wrong answers, accompanied by wrong explanations. - **Reasoning ability**: There is a significant hallucination phenomenon when the models generate explanations, indicating that they have internal problems in reasoning. - **Model comparison**: Proprietary models (such as the GPT series) are generally superior to public models (such as LLaMA, Mistral, etc.) in terms of consistency and reasoning ability. - **Hallucination problem**: Even when facing basic common - sense questions, the performance of all models does not reach 90% accuracy, showing the limitations of current LLMs in reasoning. In general, this paper reveals the challenges of LLMs in consistency and reasoning ability and emphasizes the importance of improving these models to enhance their reliability and performance.