Abstract:Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations. Queries from the dataset are presented as prompts to the LLMs, and the generated responses are evaluated against the ground truth answers. Additionally, explanations are generated to assess the models' reasoning abilities. Consistency is evaluated by repeatedly presenting the same query to the models and observing for variations in their responses. For measuring reasoning capabilities, the generated explanations are compared to the ground truth explanations using metrics such as BERT, BLEU, and F-1 scores. The findings reveal that proprietary models generally outperform public models in terms of both consistency and reasoning capabilities. However, even when presented with basic general knowledge questions, none of the models achieved a score of 90\% in both consistency and reasoning. This study underscores the direct correlation between consistency and reasoning abilities in LLMs and highlights the inherent reasoning challenges present in current language models.

Multi-Model Consistency for LLMs’ Evaluation

MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency

Evaluating the Consistency of LLM Evaluators

Evaluating Knowledge-based Cross-lingual Inconsistency in Large Language Models

Multi-Perspective Consistency Enhances Confidence Estimation in Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Aligning with Logic: Measuring, Evaluating and Improving Logical Consistency in Large Language Models

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs

Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate

Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Assessing the Reliability of Large Language Model Knowledge

Evaluating Factual Consistency of Summaries with Large Language Models

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Semantic Consistency for Assuring Reliability of Large Language Models

FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition