Abstract:Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations. Queries from the dataset are presented as prompts to the LLMs, and the generated responses are evaluated against the ground truth answers. Additionally, explanations are generated to assess the models' reasoning abilities. Consistency is evaluated by repeatedly presenting the same query to the models and observing for variations in their responses. For measuring reasoning capabilities, the generated explanations are compared to the ground truth explanations using metrics such as BERT, BLEU, and F-1 scores. The findings reveal that proprietary models generally outperform public models in terms of both consistency and reasoning capabilities. However, even when presented with basic general knowledge questions, none of the models achieved a score of 90\% in both consistency and reasoning. This study underscores the direct correlation between consistency and reasoning abilities in LLMs and highlights the inherent reasoning challenges present in current language models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate and compare the capabilities of large language models (LLMs) in terms of consistency and reasoning ability. Specifically, the paper focuses on the following points: 1. **Consistency problem**: LLMs are inconsistent when generating responses, that is, for the same query, the model may give different answers. 2. **Lack of reasoning ability**: LLMs perform poorly in providing explanations and reasoning to support their answers, and often generate wrong or misleading information. This phenomenon is called "hallucination". To solve these problems, the paper uses the BoolQ dataset as a benchmark. This dataset contains a series of yes - no questions, their corresponding correct answers and explanations. By inputting these questions as prompts to different LLMs and evaluating the answers and explanations they generate, the researchers hope to answer the following questions: - How consistent are LLMs? Can they maintain the same results in multiple queries? - How is the reasoning ability of LLMs? Are the explanations they generate reasonable and accurate? - Are there differences in consistency and reasoning ability between public models and proprietary models? ### Method overview To evaluate consistency, the researchers made the same query three times for each model and counted whether the answers were consistent each time. For the evaluation of reasoning ability, metrics such as BERT, BLEU and F - 1 were used to compare the similarity between the generated explanations and the standard explanations provided in the dataset. ### Main findings - **Consistency**: All models showed high inconsistency when providing wrong answers, accompanied by wrong explanations. - **Reasoning ability**: There is a significant hallucination phenomenon when the models generate explanations, indicating that they have internal problems in reasoning. - **Model comparison**: Proprietary models (such as the GPT series) are generally superior to public models (such as LLaMA, Mistral, etc.) in terms of consistency and reasoning ability. - **Hallucination problem**: Even when facing basic common - sense questions, the performance of all models does not reach 90% accuracy, showing the limitations of current LLMs in reasoning. In general, this paper reveals the challenges of LLMs in consistency and reasoning ability and emphasizes the importance of improving these models to enhance their reliability and performance.

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Can Large Language Models Act as Symbolic Reasoners?

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Large Language Models Are Not Strong Abstract Reasoners

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Aligning with Logic: Measuring, Evaluating and Improving Logical Consistency in Large Language Models

Large Language Models Cannot Self-Correct Reasoning Yet

Towards Logically Consistent Language Models via Probabilistic Reasoning

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Semantic Consistency for Assuring Reliability of Large Language Models

GLoRE: Evaluating Logical Reasoning of Large Language Models

Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

LLMs for Relational Reasoning: How Far are We?