Abstract:The increasing adoption of deep learning algorithms for automating downstream natural language processing (NLP) tasks has created a need to enhance their capability to assess linguistic acceptability. The CoLA corpus was created to aid in the development of models that can accurately assess grammatical acceptability and evaluate linguistic proficiency. Transformer models, widely utilized in various natural language processing tasks, including the evaluation of linguistic acceptability, may possess limitations that undermine their perceived robustness. These models exhibit susceptibility to adversarial text attacks, which are characterized by inconspicuous modifications made to the original input text. The tactfully chosen modifications are such that the adversarial examples generated, although correctly classified by human observers, successfully mislead the targeted model of the attack, consequently hindering its reliability. This paper presents a novel framework called 'Homograph' to generate adversarial text in a black-box setting. The efficacy of the suggested attack in undermining models designed for linguistic acceptability is significantly enhanced by its capability to generate visually similar adversarial examples that do not compromise the grammatical acceptability of the original input samples. These examples effectively deceive the model, causing it to modify its predicted label. In the context of the linguistic acceptability task, our attack was effectively applied to five transformer models: ALBERT, BERT, DistilBERT, RoBERTa, and XL-Net, fine-tuned on the CoLA dataset. Our work distinguishes itself from existing text-based attacks through several contributions. Firstly, we surpass previous baselines in terms of attack success rate ( ) and average perturbation rate ( ) for models trained on the CoLA dataset. Secondly, we generate more potent adversarial examples that contain imperceptible modifications, thereby preserving the original label. Lastly, we employ a straightforward character-level transformation technique to produce adversarial examples that closely resemble the original text.

Towards Crafting Text Adversarial Samples

Misleading Sentiment Analysis: Generating Adversarial Texts by the Ensemble Word Addition Algorithm

Adversarial Sample Synthesis for Visual Question Answering

Rewriting Meaningful Sentences via Conditional BERT Sampling and an application on fooling text classifiers

Frauds Bargain Attack: Generating Adversarial Text Samples via Word Manipulation Process

Detecting textual adversarial examples through text modification on text classification systems

Generating Fluent Chinese Adversarial Examples for Sentiment Classification

Adversarial trading

Normal Vs. Adversarial: Salience-based Analysis of Adversarial Samples for Relation Extraction

Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion

A Constraint-Enforcing Reward for Adversarial Attacks on Text Classifiers

Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model

On Adversarial Examples for Text Classification by Perturbing Latent Representations

Generating Adversarial Examples in Chinese Texts Using Sentence-Pieces

Word-level textual adversarial attacking based on genetic algorithm

Preserving Semantics in Textual Adversarial Attacks

HOMOGRAPH: a novel textual adversarial attack architecture to unmask the susceptibility of linguistic acceptability classifiers

Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks

TextJuggler: Fooling Text Classification Tasks by Generating High-Quality Adversarial Examples

R&R: Metric-guided Adversarial Sentence Generation

Textual Adversarial Attack As Combinatorial Optimization