Abstract:Interestingly, LLMs yet struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the word "strawberry". There are several popular conjectures (e.g., tokenization, architecture and training data) regarding the reason for deficiency of LLMs in simple word-based counting problems, sharing the similar belief that such failure stems from model pretraining hence probably inevitable during deployment. In this paper, we carefully design multiple evaluation settings to investigate validity of prevalent conjectures. Meanwhile, we measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks. Although specialized LLMs suffer from counting problems as well, we find conjectures about inherent deficiency of LLMs invalid and further seek opportunities to elicit knowledge and capabilities from LLMs that are beneficial to counting tasks. Compared with strategies such as finetuning and in-context learning that are commonly adopted to enhance performance on new or challenging tasks, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks with more accurate responses. We hope our conjecture validation design could provide insights into the study of future critical failure modes of LLMs. Based on challenges in transferring advanced capabilities to much simpler tasks, we call for more attention to model capability acquisition and evaluation. We also highlight the importance of cultivating consciousness of "reasoning before responding" during model pretraining.

What problem does this paper attempt to address?

The problem this paper attempts to address is the poor performance of large language models (LLMs) in handling some basic tasks that are very simple for humans, particularly the character counting problem. For example, even the most advanced LLMs like GPT-4o can make mistakes when counting the number of characters "r" in the word "strawberry." The research community has proposed several hypotheses for this phenomenon, including tokenization algorithms, model architecture, and training data. The paper verifies the validity of these hypotheses by designing various evaluation settings and explores how to leverage the advanced mathematical and programming reasoning capabilities of LLMs to improve their performance on simple character counting tasks. Specifically, the main contributions of the paper include: 1. **Hypothesis Verification**: By designing various evaluation settings, the paper verifies popular hypotheses about the poor performance of LLMs on character counting tasks, such as tokenization algorithms, lack of character-level training, and model size. 2. **Evaluation of Advanced Models**: The paper evaluates the performance of specially trained mathematical and programming models on simple character counting tasks and finds that these models do not significantly outperform the base models. 3. **Reasoning Strategies**: The paper explores the effectiveness of different reasoning strategies (such as chain of thought, self-consistency, self-refinement, and tree of thought) in improving LLMs' performance on character counting tasks and finds that reasoning strategies are among the most effective methods. The paper hopes that these research findings can provide guidance for future research on LLMs, particularly in the areas of model capability acquisition and comprehensive capability evaluation.

LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems

Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of Large Language Models with Misconceptions

Easy Problems That LLMs Get Wrong

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From A Psychological Perspective

LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Not All LLM Reasoners Are Created Equal

Are Your LLMs Capable of Stable Reasoning?

From Blind Solvers to Logical Thinkers: Benchmarking LLMs' Logical Integrity on Faulty Mathematical Problems

Large Language Models Are Unconscious of Unreasonability in Math Problems

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

Interpreting and Improving Large Language Models in Arithmetic Calculation

Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

LLMs for Relational Reasoning: How Far are We?

Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding

Can LLMs Compute with Reasons?

Arithmetic Reasoning with LLM: Prolog Generation & Permutation