Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

Ashim Gupta,Rishanth Rajendhran,Nathan Stringham,Vivek Srikumar,Ana Marasović

2024-04-03

Abstract:Do larger and more performant models resolve NLP's longstanding robustness issues? We investigate this question using over 20 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) out-of-domain and challenge test sets, (b) behavioral testing with CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all out-of-domain tests provide insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them adequately robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.

Computation and Language

What problem does this paper attempt to address?

This paper attempts to explore whether large - scale language models have solved the long - existing robustness problem in natural language processing (NLP). Specifically, the author evaluates the robustness of models with more than 20 different architectures and pre - training objectives by using the following methods: 1. **Out - of - Domain Tests**: Evaluate the performance of the model on unseen data. 2. **Behavioral Testing**: Use the CheckLists method to check the basic task capabilities of the model. 3. **Contrast Sets**: Evaluate the performance of the model on slightly different examples. 4. **Adversarial Inputs**: Test the performance of the model under adversarial attacks. The author finds that although the increase in model scale improves performance in some aspects, it does not completely solve the robustness problem in NLP. In addition, some current adversarial evaluation methods themselves also have problems, are easy to be bypassed, and fail to deeply probe the robustness of the model. Therefore, the paper concludes that the robustness problem in NLP has not been solved, and even some methods for evaluating robustness need to be re - evaluated.

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

Methods for Estimating and Improving Robustness of Language Models

Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

Certified Robustness to Adversarial Word Substitutions

ROBY: Evaluating the adversarial robustness of a deep model by its decision boundaries

From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework

Robustness of LLMs to Perturbations in Text

How many perturbations break this model? Evaluating robustness beyond adversarial accuracy

Measuring Neural Net Robustness with Constraints

Model-tuning Via Prompts Makes NLP Models Adversarially Robust

SelfPrompt: Autonomously Evaluating LLM Robustness via Domain-Constrained Knowledge Guidelines and Refined Adversarial Prompts

Measure and Improve Robustness in NLP Models: A Survey

Extreme Miscalibration and the Illusion of Adversarial Robustness

Improving Robustness of Task Oriented Dialog Systems

Enhancing Model Robustness Via Lexical Distilling

There is more than one kind of robustness: Fooling Whisper with adversarial examples

A Multilingual Evaluation of NER Robustness to Adversarial Inputs

Exploring Scaling Trends in LLM Robustness

A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios

Assessing Adversarial Robustness of Large Language Models: An Empirical Study