Abstract:Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis to investigate the cause of failure, providing further insights into the limitations of LLMs.

What problem does this paper attempt to address?

The paper aims to address the fairness issue of large language models (LLMs) in educational assessment, particularly in solving Math Word Problems (MWPs). With the significant advancements in LLMs' natural language generation and problem-solving capabilities, students can use these tools to complete assignments, posing a challenge for educators to accurately assess students' actual problem-solving abilities. To tackle this issue, the paper proposes a new paradigm of ensuring fair assessment by generating adversarial examples. Specifically, the research focuses on the domain of math word problems, generating adversarial examples by modifying the numerical values in the problems. These examples retain the original problem's structure and difficulty but make it impossible for LLMs to solve them correctly. The main contributions of the paper include: 1. **Proposed Method**: Transforming MWPs into Python code and then using abstract syntax trees (AST) for structured modifications to generate adversarial examples in a controlled manner. 2. **Educational Constraints**: Defining a set of constraints to ensure that the generated problems remain logical and educationally valuable, such as maintaining the positivity or negativity of numbers, integer properties, and appropriate fraction ranges. 3. **Generation Methods**: Proposing three different generation methods (M1, M2, M3) to control the difficulty level of the generated problems. Among them, M3 is the strictest generation method, ensuring that the adversarial examples are closest to the original problems in terms of difficulty and coherence. 4. **Experimental Results**: Conducting experiments on various open-source and closed-source LLMs, the results show that even problems generated by the strictest constraint method can significantly reduce the problem-solving accuracy of LLMs. Additionally, the paper compares the effects of different generation methods and analyzes the generality and transferability of the adversarial examples. In summary, the goal of this paper is to test and reveal the limitations of large language models in solving specific types of adversarial math word problems, thereby providing a new tool for educational assessment.

Adversarial Math Word Problem Generation

LLM-Resistant Math Word Problem Generation via Adversarial Attacks

MathAttack: Attacking Large Language Models Towards Math Solving Ability

Investigating the Robustness of LLMs on Math Word Problems

MATHWELL: Generating Educational Math Word Problems Using Teacher Annotations

Cutting Through the Noise: Boosting LLM Performance on Math Word Problems

Solving Math Word Problems by Combining Language Models With Symbolic Solvers

What Makes Math Word Problems Challenging for LLMs?

AI-Assisted Generation of Difficult Math Questions

Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments

Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models

MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

Large Language Models Are Unconscious of Unreasonability in Math Problems

Universal and Transferable Adversarial Attacks on Aligned Language Models

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Why are NLP Models Fumbling at Elementary Math? A Survey of Deep Learning based Word Problem Solvers

Exploring the Adversarial Capabilities of Large Language Models

Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of Large Language Models with Misconceptions

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs