Abstract:Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the challenges encountered by large language models (LLMs) when dealing with math word problems (MWPs), especially the problem that the performance of the model drops significantly when the problems contain irrelevant information. Specifically: 1. **Problem background**: - Large language models perform well in solving math word problems, but perform poorly when dealing with real - world problems containing irrelevant information. - Existing math data sets usually contain simplified problems, where variables and numerical data are directly related to the problems, while math word problems in reality often contain irrelevant interfering information, which will distract the model and affect its reasoning ability. 2. **Research objectives**: - Propose a prompt framework to generate adversarial variant math word problems (adversarial MWPs) with irrelevant variables, in order to test and improve the model's robustness to noise. - Introduce a new data set PROBLEMATHIC, which contains adversarial and non - adversarial math word problems, for evaluating the model's performance. - Improve the model's ability to identify relevant data and perform correct reasoning by fine - tuning large language models on adversarial samples. 3. **Main findings**: - The experimental results show that the performance of large language models drops by about 26% on average when facing adversarial math word problems. - By fine - tuning the model on adversarial samples, its performance on adversarial problems can be improved, with an average improvement of 8%, indicating that the model's robustness to noise has been enhanced. 4. **Contributions**: - Introduced the PROBLEMATHIC data set, demonstrating the sensitivity of large language models to irrelevant numerical information. - Proposed a prompt framework for generating adversarial variants of existing math word problems, and showed that fine - tuning on these samples can improve model performance. - Created an adversarial variant GSM - 8K - Adv, further verifying the effectiveness of the prompt framework. ### Summary This paper systematically studied the performance of large language models in dealing with noisy math word problems by introducing new data sets and prompt frameworks, and proposed an effective method to improve the model's robustness and reasoning ability.

Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Investigating the Robustness of LLMs on Math Word Problems

Adversarial Math Word Problem Generation

LLM-Resistant Math Word Problem Generation via Adversarial Attacks

MathAttack: Attacking Large Language Models towards Math Solving Ability

What Makes Math Word Problems Challenging for LLMs?

Enhancing Robustness in Large Language Models: Prompting for Mitigating the Impact of Irrelevant Information

GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Fill in the Blank: Exploring and Enhancing LLM Capabilities for Backward Reasoning in Math Word Problems

Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

Efficient Adversarial Training in LLMs with Continuous Attacks

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems

Reasoning Robustness of LLMs to Adversarial Typographical Errors

Can LLMs Solve longer Math Word Problems Better?

Expanding Search Space with Diverse Prompting Agents: An Efficient Sampling Approach for LLM Mathematical Reasoning

Enhancing Mathematical Reasoning in LLMs with Background Operators

Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems