Abstract:With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly impacting their effectiveness in practical applications. To systematically understand the robustness of LLMs, we present RUPBench, a comprehensive benchmark designed to evaluate LLM robustness across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning, and introduces nine types of textual perturbations at lexical, syntactic, and semantic levels. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns. Our findings highlight that larger models tend to exhibit greater robustness to perturbations. Additionally, common error types are identified through manual inspection, revealing specific challenges faced by LLMs in different reasoning contexts. This work provides insights into areas where LLMs need further improvement to handle diverse and noisy inputs effectively.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the robustness of large-scale language models (LLMs) against various perturbations in practical applications. Although LLMs perform excellently in many tasks, they often perform poorly when faced with adversarial inputs, which severely affects their effectiveness and reliability in real-world scenarios. Specifically, the paper attempts to solve this problem through the following points: 1. **Systematic Robustness Evaluation**: Existing evaluation methods usually focus on specific tasks or types of perturbations, lacking a comprehensive evaluation framework. The paper proposes a benchmark named RUPBench, which covers multiple reasoning tasks and different types of text perturbations to systematically evaluate the robustness of LLMs. 2. **Diversified Perturbation Types**: RUPBench introduces nine different types of text perturbations, including lexical, syntactic, and semantic perturbations, to simulate real-world input variations. These perturbations aim to test the model's performance when faced with different forms of noise and erroneous inputs. 3. **In-depth Error Pattern Analysis**: By conducting a detailed analysis of the performance of existing LLMs (such as GPT-4o, Llama3, Phi-3, and Gemma) on both original and perturbed datasets, the paper reveals the robustness of the models in different reasoning tasks and common error patterns. This helps identify specific challenges the models face in particular reasoning contexts. 4. **Guiding Future Research**: Through manual inspection of erroneous predictions, the paper identifies some common error types, such as context misunderstanding and knowledge gaps. These findings provide directions for future research, emphasizing the need for targeted strategies to address these weaknesses and improve the robustness and reliability of the models. ### Main Contributions 1. **Introduction of RUPBench**: A comprehensive benchmark framework is proposed, including 15 reasoning datasets and nine types of text perturbations, generating a total of 365,580 perturbed samples. 2. **Evaluation of Multiple LLMs**: Extensive experiments are conducted on several state-of-the-art LLMs, evaluating their performance on both original and perturbed datasets, providing detailed robustness analysis. 3. **Identification of Common Error Types**: Through manual inspection of erroneous predictions, some common error types are identified, providing guidance for future research and highlighting specific areas that need improvement. ### Summary By introducing the RUPBench benchmark, the paper systematically evaluates the robustness of LLMs against various types of perturbations and reveals common error patterns through detailed analysis. These findings not only help understand the limitations of current LLMs but also provide clear directions for future improvements.

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis

Assessing and Enhancing the Robustness of Large Language Models with Task Structure Variations for Logical Reasoning

SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts

Benchmarking Large Language Models for Math Reasoning Tasks

PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

Robustness of LLMs to Perturbations in Text

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Reasoning Robustness of LLMs to Adversarial Typographical Errors

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Investigating the Robustness of LLMs on Math Word Problems

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

When LLMs Meet Cunning Questions: A Fallacy Understanding Benchmark for Large Language Models

NLPBench: Evaluating Large Language Models on Solving NLP Problems

$\text{R}^2$-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations