Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Ujjwala Anantheswaran,Himanshu Gupta,Kevin Scaria,Shreyas Verma,Chitta Baral,Swaroop Mishra
2024-10-24
Abstract:Large Language Models (LLMs) excel at various tasks, including solving math word problems (MWPs), but struggle with real-world problems containing irrelevant information. To address this, we propose a prompting framework that generates adversarial variants of MWPs by adding irrelevant variables. We introduce a dataset, PROBLEMATHIC, containing both adversarial and non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to distraction by numerical noise, resulting in an average relative performance drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2, Mistral) on the adversarial samples from our dataset. Fine-tuning on adversarial training instances improves performance on adversarial MWPs by ~8%, indicating increased robustness to noise and improved ability to identify relevant data for reasoning. Finally, to assess the generalizability of our prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the GSM-8K benchmark. LLMs continue to struggle when faced with adversarial information, reducing performance by up to 6%.
Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the challenges encountered by large language models (LLMs) when dealing with math word problems (MWPs), especially the problem that the performance of the model drops significantly when the problems contain irrelevant information. Specifically: 1. **Problem background**: - Large language models perform well in solving math word problems, but perform poorly when dealing with real - world problems containing irrelevant information. - Existing math data sets usually contain simplified problems, where variables and numerical data are directly related to the problems, while math word problems in reality often contain irrelevant interfering information, which will distract the model and affect its reasoning ability. 2. **Research objectives**: - Propose a prompt framework to generate adversarial variant math word problems (adversarial MWPs) with irrelevant variables, in order to test and improve the model's robustness to noise. - Introduce a new data set PROBLEMATHIC, which contains adversarial and non - adversarial math word problems, for evaluating the model's performance. - Improve the model's ability to identify relevant data and perform correct reasoning by fine - tuning large language models on adversarial samples. 3. **Main findings**: - The experimental results show that the performance of large language models drops by about 26% on average when facing adversarial math word problems. - By fine - tuning the model on adversarial samples, its performance on adversarial problems can be improved, with an average improvement of 8%, indicating that the model's robustness to noise has been enhanced. 4. **Contributions**: - Introduced the PROBLEMATHIC data set, demonstrating the sensitivity of large language models to irrelevant numerical information. - Proposed a prompt framework for generating adversarial variants of existing math word problems, and showed that fine - tuning on these samples can improve model performance. - Created an adversarial variant GSM - 8K - Adv, further verifying the effectiveness of the prompt framework. ### Summary This paper systematically studied the performance of large language models in dealing with noisy math word problems by introducing new data sets and prompt frameworks, and proposed an effective method to improve the model's robustness and reasoning ability.