Abstract:The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary LLMs. To address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the longer conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters, which may exhibit oversimplification and inherent biases. Our extensive empirical experiments demonstrate that TREvaL provides an innovative method for evaluating the robustness of an LLM. Furthermore, our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage. Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted. The code of TREval is available in <a class="link-external link-https" href="https://github.com/Harry-mic/TREvaL" rel="external noopener nofollow">this https URL</a>.

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

Methods for Estimating and Improving Robustness of Language Models

Are Large Language Models Really Robust to Word-Level Perturbations?

Robustness of LLMs to Perturbations in Text

Robustness Analysis of Video-Language Models Against Visual and Language Perturbations

On Adversarial Robustness and Out-of-Distribution Robustness of Large Language Models

Robustness of Large Language Models Against Adversarial Attacks

Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task

Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

Improving the Robustness of Large Language Models via Consistency Alignment

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts

A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios

Robustifying Language Models with Test-Time Adaptation

Large Language Models with Controllable Working Memory

Noisy Exemplars Make Large Language Models More Robust: A Domain-Agnostic Behavioral Analysis

PromptRobust: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts

Robustness Testing of Language Understanding in Task-Oriented Dialog

On the Adversarial Robustness of Instruction-Tuned Large Language Models for Code

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation