Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?

Zhe Yang,Yichang Zhang,Tianyu Liu,Jian Yang,Junyang Lin,Chang Zhou,Zhifang Sui
2024-06-19
Abstract:Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.
Computation and Language
What problem does this paper attempt to address?
The paper primarily explores the inconsistency phenomenon observed in large language models (LLMs) when solving simple problems, even though these models can solve more difficult problems. Specifically, the paper presents the following key points: 1. **Research Background**: Despite the excellent performance of large language models in natural language processing tasks, they exhibit inconsistent behavior when faced with some simple problems. For example, a model can correctly solve complex problems but may make mistakes when solving relatively simple ones. 2. **Problem Definition**: The paper defines the problem of "consistency from hard to easy" and demonstrates this phenomenon through examples. For instance, a model can solve complex math problems but may err on simple addition or multiplication problems. 3. **Benchmarking**: To systematically evaluate this consistency issue, the authors developed a benchmark test set called ConsisEval, which includes data from three domains: code, mathematics, and instruction following. Each data entry contains a pair of strictly difficulty-ordered problems (one simple and one more difficult). 4. **Evaluation Metrics**: The paper proposes a new evaluation metric—Consistency Score (CS)—to quantitatively measure the model's consistency performance in solving simple problems. Additionally, a Relative Consistency Score (RCS) is introduced to analyze the potential for consistency improvement with constant capability. 5. **Experimental Results**: Extensive experiments on various large language models reveal that GPT-4 performs best in terms of consistency score, achieving 92.2%, but still exhibits inconsistent behavior in specific cases. Overall, more capable models generally show higher consistency, though exceptions exist. 6. **Further Analysis**: The paper also analyzes the impact of training data difficulty on model consistency, finding that training sets containing more difficult data help improve model consistency. Moreover, using more difficult examples in In-Context Learning also helps enhance consistency. In summary, the paper systematically studies the consistency issue of large language models in solving simple problems, proposes corresponding evaluation methods, and provides experimental analysis, offering valuable references for further research.