Abstract:Numerous adversarial attack methods have been developed to generate imperceptible image perturbations that can cause erroneous predictions of state-of-the-art machine learning (ML) models, in particular, deep neural networks (DNNs). Despite intense research on adversarial attacks, little effort was made to uncover 'arcana' carried in adversarial attacks. In this work, we ask whether it is possible to infer data-agnostic victim model (VM) information (i.e., characteristics of the ML model or DNN used to generate adversarial attacks) from data-specific adversarial instances. We call this 'model parsing of adversarial attacks' - a task to uncover 'arcana' in terms of the concealed VM information in attacks. We approach model parsing via supervised learning, which correctly assigns classes of VM's model attributes (in terms of architecture type, kernel size, activation function, and weight sparsity) to an attack instance generated from this VM. We collect a dataset of adversarial attacks across 7 attack types generated from 135 victim models (configured by 5 architecture types, 3 kernel size setups, 3 activation function types, and 3 weight sparsity ratios). We show that a simple, supervised model parsing network (MPN) is able to infer VM attributes from unseen adversarial attacks if their attack settings are consistent with the training setting (i.e., in-distribution generalization assessment). We also provide extensive experiments to justify the feasibility of VM parsing from adversarial attacks, and the influence of training and evaluation factors in the parsing performance (e.g., generalization challenge raised in out-of-distribution evaluation). We further demonstrate how the proposed MPN can be used to uncover the source VM attributes from transfer attacks, and shed light on a potential connection between model parsing and attack transferability.

KLAttack: Towards Adversarial Attack and Defense on Neural Dependency Parsing Models

Training NLI Models Through Universal Adversarial Attack

Evaluating and Enhancing the Robustness of Neural Network-based Dependency Parsing Models with Adversarial Examples

A Closer Look into the Robustness of Neural Dependency Parsers Using Better Adversarial Examples.

Learning to Attack: Towards Textual Adversarial Attacking in Real-world Situations

Adversarial Attack and Defense of Structured Prediction Models

Towards Improving Adversarial Training of NLP Models

Contextualized Perturbation for Textual Adversarial Attack

Target-driven Attack for Large Language Models

Modeling Adversarial Attack on Pre-trained Language Models As Sequential Decision Making

A Black-box NLP Classifier Attacker

Training-free Lexical Backdoor Attacks on Language Models

Phrase-level Textual Adversarial Attack with Label Preservation

Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework

ALANCA: Active Learning Guided Adversarial Attacks for Code Comprehension on Diverse Pre-trained and Large Language Models

Searching for an Effective Defender: Benchmarking Defense Against Adversarial Word Substitution

Can Adversarial Examples Be Parsed to Reveal Victim Model Information?

BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Mutual-modality Adversarial Attack with Semantic Perturbation

TF-Attack: Transferable and Fast Adversarial Attacks on Large Language Models

Adversarial Attacks and Defense for Conversation Entailment Task