Over-Reasoning and Redundant Calculation of Large Language Models

Cheng-Han Chiang,Hung-yi Lee
2024-03-20
Abstract:Large language models (LLMs) can solve problems step-by-step. While this chain-of-thought (CoT) reasoning boosts LLMs' performance, it is unclear if LLMs \textit{know} when to use CoT and whether those CoT are always necessary to answer the question. This paper shows that LLMs tend to generate redundant calculations and reasoning on a manually constructed math QA dataset, GSM8K-Zero. GSM8K-Zero is constructed such that the questions can be answered without any calculations, but LLMs, including Llama-2 models and Claude-2, tend to generate lengthy and unnecessary calculations to answer the questions. We also conduct experiments to explain why LLMs generate redundant calculations and reasonings. GSM8K-Zero is publicly available at
Computation and Language
What problem does this paper attempt to address?
This paper primarily discusses the issues of excessive reasoning and redundant computation in large-scale language models (LLMs) when answering questions. Although Chain of Thought (CoT) improves the performance of LLMs, it is unclear whether they know when to use CoT and if these CoTs are always necessary. The research found that even on a math problem set GSM8K-Zero where no computation is needed, LLMs still generate lengthy and unnecessary calculations. Through experiments, the authors reveal that LLMs tend to give verbose answers, which may lead to errors and mislead users into thinking the problem is more complex. Additionally, the research also demonstrates that GPT-4 and ChatGPT, which are used to train reward models, have a preference for long answers that include redundant computations, even when these answers are incorrect. The paper suggests that LLMs may not accurately determine when incremental reasoning is needed and proposes future research to focus on reducing the redundancy in LLMs' outputs.