FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in Large Language Models

Yiyuan Li,Shichao Sun,Pengfei Liu
2024-07-03
Abstract:Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore the capabilities of large language models (LLMs) in handling fuzzy reasoning. Specifically, the authors introduce a new benchmark test—FROG (Fuzzy Reasoning of Generalized Quantifiers), which includes practical mathematical application problems that incorporate generalized quantifiers (GQs) such as "few," "most," etc. These terms introduce a certain degree of fuzziness, making the problem-solving process require fuzzy reasoning. ### Main Research Questions 1. **Effectiveness of Existing Reasoning Enhancement Methods in FROG**: Evaluate whether existing reasoning enhancement methods (such as math-specific fine-tuning, code-specific fine-tuning, and general alignment fine-tuning) can effectively improve LLMs' performance on FROG. 2. **Applicability of Scaling Laws in FROG**: Investigate whether the performance of LLMs on FROG follows scaling laws as the number of model parameters increases. 3. **Transferability of Strong Mathematical Reasoning to Fuzzy Reasoning**: Study whether LLMs with strong mathematical reasoning capabilities also perform well in handling fuzzy reasoning tasks. ### Background and Motivation Currently, most evaluations of LLMs' reasoning capabilities focus on precise mathematical reasoning tasks, such as the GSM8K and MATH datasets. However, many decision-making processes and knowledge expressions in the real world are fuzzy, involving uncertainty and perceptual data. Therefore, studying LLMs' performance in handling fuzzy reasoning tasks is of significant importance. ### Methods 1. **Constructing the FROG Benchmark**: Collect mathematical application problems involving percentages from the GSM8K and MathQA datasets, and replace specific percentage values with generalized quantifiers to generate multiple-choice questions. 2. **Experimental Setup**: Evaluate the performance of several open-source LLMs on FROG, including Llama-2, CodeLlama, Qwen-1.5, Tulu-2, WizardLM, WizardMath, and Yi-Chat. 3. **Evaluation Metrics**: Primarily assess the Mask accuracy of models at different task difficulties and analyze the performance of different models on FROG-Easy and FROG-Hard. ### Experimental Results 1. **Overall Results**: All models generally showed low accuracy on FROG, ranging from 0.05 to 0.45, indicating that fuzzy reasoning is a challenge for current LLMs. 2. **Effectiveness of Reasoning Enhancement Methods**: Math-specific fine-tuning and code-specific fine-tuning did not significantly improve the models' performance on FROG, and the effects of general alignment fine-tuning were inconsistent. 3. **Applicability of Scaling Laws**: More than half of the model families exhibited a reverse scaling effect on FROG, where increasing the number of model parameters led to a decrease in performance. 4. **Transferability of Mathematical Reasoning**: Models with strong mathematical reasoning capabilities did not necessarily perform well in fuzzy reasoning tasks, indicating that precise reasoning and fuzzy reasoning are two different abilities. ### Conclusion By introducing the FROG benchmark test, this paper reveals the current shortcomings of LLMs in handling fuzzy reasoning tasks. The study finds that existing reasoning enhancement methods and scaling laws do not fully apply to fuzzy reasoning tasks, and strong mathematical reasoning capabilities do not necessarily transfer to fuzzy reasoning tasks. These results provide important references for future improvements in LLMs' fuzzy reasoning capabilities.