Abstract:Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations. Queries from the dataset are presented as prompts to the LLMs, and the generated responses are evaluated against the ground truth answers. Additionally, explanations are generated to assess the models' reasoning abilities. Consistency is evaluated by repeatedly presenting the same query to the models and observing for variations in their responses. For measuring reasoning capabilities, the generated explanations are compared to the ground truth explanations using metrics such as BERT, BLEU, and F-1 scores. The findings reveal that proprietary models generally outperform public models in terms of both consistency and reasoning capabilities. However, even when presented with basic general knowledge questions, none of the models achieved a score of 90\% in both consistency and reasoning. This study underscores the direct correlation between consistency and reasoning abilities in LLMs and highlights the inherent reasoning challenges present in current language models.

Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate

Argumentative Large Language Models for Explainable and Contestable Decision-Making

Aligning with Logic: Measuring, Evaluating and Improving Logical Consistency in Large Language Models

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs

Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models

An Empirical Analysis on Large Language Models in Debate Evaluation

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

Enhancing Multi-Agent Consensus through Third-Party LLM Integration: Analyzing Uncertainty and Mitigating Hallucinations in Large Language Models

Aligning Large Language Models for Faithful Integrity Against Opposing Argument

MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate

Evaluating the Performance of Large Language Models via Debates

Logical Consistency of Large Language Models in Fact-checking

LLM2: Let Large Language Models Harness System 2 Reasoning

Limits of Large Language Models in Debating Humans

Argumentation Computation with Large Language Models : A Benchmark Study

Multi-Model Consistency for LLMs’ Evaluation

MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration

Exploring the Potential of Large Language Models in Computational Argumentation