Abstract:Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.

What problem does this paper attempt to address?

The paper primarily explores the inconsistency phenomenon observed in large language models (LLMs) when solving simple problems, even though these models can solve more difficult problems. Specifically, the paper presents the following key points: 1. **Research Background**: Despite the excellent performance of large language models in natural language processing tasks, they exhibit inconsistent behavior when faced with some simple problems. For example, a model can correctly solve complex problems but may make mistakes when solving relatively simple ones. 2. **Problem Definition**: The paper defines the problem of "consistency from hard to easy" and demonstrates this phenomenon through examples. For instance, a model can solve complex math problems but may err on simple addition or multiplication problems. 3. **Benchmarking**: To systematically evaluate this consistency issue, the authors developed a benchmark test set called ConsisEval, which includes data from three domains: code, mathematics, and instruction following. Each data entry contains a pair of strictly difficulty-ordered problems (one simple and one more difficult). 4. **Evaluation Metrics**: The paper proposes a new evaluation metric—Consistency Score (CS)—to quantitatively measure the model's consistency performance in solving simple problems. Additionally, a Relative Consistency Score (RCS) is introduced to analyze the potential for consistency improvement with constant capability. 5. **Experimental Results**: Extensive experiments on various large language models reveal that GPT-4 performs best in terms of consistency score, achieving 92.2%, but still exhibits inconsistent behavior in specific cases. Overall, more capable models generally show higher consistency, though exceptions exist. 6. **Further Analysis**: The paper also analyzes the impact of training data difficulty on model consistency, finding that training sets containing more difficult data help improve model consistency. Moreover, using more difficult examples in In-Context Learning also helps enhance consistency. In summary, the paper systematically studies the consistency issue of large language models in solving simple problems, proposes corresponding evaluation methods, and provides experimental analysis, offering valuable references for further research.

Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Self-Consistency of Large Language Models under Ambiguity

Are Large Language Models Consistent over Value-laden Questions?

Examining Inter-Consistency of Large Language Models Collaboration: An In-depth Analysis via Debate

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Enhancing Large Language Models in Coding Through Multi-Perspective Self-Consistency

Semantic Consistency for Assuring Reliability of Large Language Models

Improving the Robustness of Large Language Models via Consistency Alignment

Large Language Models are Inconsistent and Biased Evaluators

Evaluating the Consistency of LLM Evaluators

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Larger and more instructable language models become less reliable

Knowledge-based Consistency Testing of Large Language Models

MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Multi-Model Consistency for LLMs’ Evaluation

Aligning with Logic: Measuring, Evaluating and Improving Logical Consistency in Large Language Models

Easy Problems That LLMs Get Wrong

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers

Consistency Matters: Explore LLMs Consistency From a Black-Box Perspective