Abstract:Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to explore the capabilities of large language models (LLMs) in handling fuzzy reasoning. Specifically, the authors introduce a new benchmark test—FROG (Fuzzy Reasoning of Generalized Quantifiers), which includes practical mathematical application problems that incorporate generalized quantifiers (GQs) such as "few," "most," etc. These terms introduce a certain degree of fuzziness, making the problem-solving process require fuzzy reasoning. ### Main Research Questions 1. **Effectiveness of Existing Reasoning Enhancement Methods in FROG**: Evaluate whether existing reasoning enhancement methods (such as math-specific fine-tuning, code-specific fine-tuning, and general alignment fine-tuning) can effectively improve LLMs' performance on FROG. 2. **Applicability of Scaling Laws in FROG**: Investigate whether the performance of LLMs on FROG follows scaling laws as the number of model parameters increases. 3. **Transferability of Strong Mathematical Reasoning to Fuzzy Reasoning**: Study whether LLMs with strong mathematical reasoning capabilities also perform well in handling fuzzy reasoning tasks. ### Background and Motivation Currently, most evaluations of LLMs' reasoning capabilities focus on precise mathematical reasoning tasks, such as the GSM8K and MATH datasets. However, many decision-making processes and knowledge expressions in the real world are fuzzy, involving uncertainty and perceptual data. Therefore, studying LLMs' performance in handling fuzzy reasoning tasks is of significant importance. ### Methods 1. **Constructing the FROG Benchmark**: Collect mathematical application problems involving percentages from the GSM8K and MathQA datasets, and replace specific percentage values with generalized quantifiers to generate multiple-choice questions. 2. **Experimental Setup**: Evaluate the performance of several open-source LLMs on FROG, including Llama-2, CodeLlama, Qwen-1.5, Tulu-2, WizardLM, WizardMath, and Yi-Chat. 3. **Evaluation Metrics**: Primarily assess the Mask accuracy of models at different task difficulties and analyze the performance of different models on FROG-Easy and FROG-Hard. ### Experimental Results 1. **Overall Results**: All models generally showed low accuracy on FROG, ranging from 0.05 to 0.45, indicating that fuzzy reasoning is a challenge for current LLMs. 2. **Effectiveness of Reasoning Enhancement Methods**: Math-specific fine-tuning and code-specific fine-tuning did not significantly improve the models' performance on FROG, and the effects of general alignment fine-tuning were inconsistent. 3. **Applicability of Scaling Laws**: More than half of the model families exhibited a reverse scaling effect on FROG, where increasing the number of model parameters led to a decrease in performance. 4. **Transferability of Mathematical Reasoning**: Models with strong mathematical reasoning capabilities did not necessarily perform well in fuzzy reasoning tasks, indicating that precise reasoning and fuzzy reasoning are two different abilities. ### Conclusion By introducing the FROG benchmark test, this paper reveals the current shortcomings of LLMs in handling fuzzy reasoning tasks. The study finds that existing reasoning enhancement methods and scaling laws do not fully apply to fuzzy reasoning tasks, and strong mathematical reasoning capabilities do not necessarily transfer to fuzzy reasoning tasks. These results provide important references for future improvements in LLMs' fuzzy reasoning capabilities.

FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

LLMs for Relational Reasoning: How Far are We?

Towards Generalizable and Faithful Logic Reasoning over Natural Language via Resolution Refutation

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding

Towards Reasoning in Large Language Models: A Survey

Analysis of the Reasoning with Redundant Information Provided Ability of Large Language Models

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs

Enhancing Quantitative Reasoning Skills of Large Language Models Through Dimension Perception

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

GLoRE: Evaluating Logical Reasoning of Large Language Models

Can Large Language Models Reason? A Characterization via 3-SAT

Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

When LLMs Meet Cunning Questions: A Fallacy Understanding Benchmark for Large Language Models

Case Study: Testing Model Capabilities in Some Reasoning Tasks

Evaluating Mathematical Reasoning Beyond Accuracy