RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

Yuqing Wang,Yun Zhao
2024-06-17
Abstract:With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly impacting their effectiveness in practical applications. To systematically understand the robustness of LLMs, we present RUPBench, a comprehensive benchmark designed to evaluate LLM robustness across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning, and introduces nine types of textual perturbations at lexical, syntactic, and semantic levels. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns. Our findings highlight that larger models tend to exhibit greater robustness to perturbations. Additionally, common error types are identified through manual inspection, revealing specific challenges faced by LLMs in different reasoning contexts. This work provides insights into areas where LLMs need further improvement to handle diverse and noisy inputs effectively.
Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the robustness of large-scale language models (LLMs) against various perturbations in practical applications. Although LLMs perform excellently in many tasks, they often perform poorly when faced with adversarial inputs, which severely affects their effectiveness and reliability in real-world scenarios. Specifically, the paper attempts to solve this problem through the following points: 1. **Systematic Robustness Evaluation**: Existing evaluation methods usually focus on specific tasks or types of perturbations, lacking a comprehensive evaluation framework. The paper proposes a benchmark named RUPBench, which covers multiple reasoning tasks and different types of text perturbations to systematically evaluate the robustness of LLMs. 2. **Diversified Perturbation Types**: RUPBench introduces nine different types of text perturbations, including lexical, syntactic, and semantic perturbations, to simulate real-world input variations. These perturbations aim to test the model's performance when faced with different forms of noise and erroneous inputs. 3. **In-depth Error Pattern Analysis**: By conducting a detailed analysis of the performance of existing LLMs (such as GPT-4o, Llama3, Phi-3, and Gemma) on both original and perturbed datasets, the paper reveals the robustness of the models in different reasoning tasks and common error patterns. This helps identify specific challenges the models face in particular reasoning contexts. 4. **Guiding Future Research**: Through manual inspection of erroneous predictions, the paper identifies some common error types, such as context misunderstanding and knowledge gaps. These findings provide directions for future research, emphasizing the need for targeted strategies to address these weaknesses and improve the robustness and reliability of the models. ### Main Contributions 1. **Introduction of RUPBench**: A comprehensive benchmark framework is proposed, including 15 reasoning datasets and nine types of text perturbations, generating a total of 365,580 perturbed samples. 2. **Evaluation of Multiple LLMs**: Extensive experiments are conducted on several state-of-the-art LLMs, evaluating their performance on both original and perturbed datasets, providing detailed robustness analysis. 3. **Identification of Common Error Types**: Through manual inspection of erroneous predictions, some common error types are identified, providing guidance for future research and highlighting specific areas that need improvement. ### Summary By introducing the RUPBench benchmark, the paper systematically evaluates the robustness of LLMs against various types of perturbations and reveals common error patterns through detailed analysis. These findings not only help understand the limitations of current LLMs but also provide clear directions for future improvements.