Abstract:Neural networks, such as image classifiers, are frequently trained on proprietary and confidential datasets. It is generally assumed that once deployed, the training data remains secure, as adversaries are limited to query response interactions with the model, where at best, fragments of arbitrary data can be inferred without any guarantees on their authenticity. In this paper, we propose the memory backdoor attack, where a model is covertly trained to memorize specific training samples and later selectively output them when triggered with an index pattern. What makes this attack unique is that it (1) works even when the tasks conflict (making a classifier output images), (2) enables the systematic extraction of training samples from deployed models and (3) offers guarantees on the extracted authenticity of the data. We demonstrate the attack on image classifiers, segmentation models, and a large language model (LLM). We demonstrate the attack on image classifiers, segmentation models, and a large language model (LLM). With this attack, it is possible to hide thousands of images and texts in modern vision architectures and LLMs respectively, all while maintaining model performance. The memory back door attack poses a significant threat not only to conventional model deployments but also to federated learning paradigms and other modern frameworks. Therefore, we suggest an efficient and effective countermeasure that can be immediately applied and advocate for further work on the topic.
Cryptography and Security,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to extract training data from neural network models by covert means after the models are deployed, especially when this data contains sensitive or proprietary information. Specifically, the paper proposes an attack method named "memory backdoor", which allows an attacker to selectively output specific training samples when the model is triggered. The uniqueness of this attack lies in:
1. **Task conflict**: Even if there are conflicts between tasks (for example, making a classifier output an image), the attack is still effective.
2. **Systematic extraction**: It can systematically extract training samples from the deployed model and ensure the authenticity of the extracted data.
3. **Wide applicability**: It is applicable not only to traditional model deployment but also to modern frameworks such as federated learning.
### Main contributions of the paper
1. **Identifying memory backdoor attacks**: The paper is the first to identify this new type of attack, memory backdoor, which can target predictive and generative models. An attacker can achieve this goal by embedding a backdoor in the training data or code, thereby extracting training data in a black - box model, which raises significant concerns about data privacy.
2. **Implementing memory backdoor for predictive visual models**: The paper proposes a specific implementation method named "Pixel Pirate", which can deterministically extract images in memory and is universal in models with different architectures, including fully - connected models, convolutional neural networks, and visual transformers.
3. **Innovative indexing method**: The paper proposes a new indexing method for indexing memory samples in visual models. Through this indexing, the extraction task can be triggered, memory image blocks can be systematically located and extracted, and the position of each block can be determined.
4. **Detection method**: The paper proposes a simple and effective detection method based on image entropy to detect the trigger pattern of Pixel Pirate. At the same time, the paper points out that the trigger pattern may become more covert and calls on the community to further study better solutions.
5. **Memory backdoor in large - language models**: The paper verifies the threat of memory backdoor in large - language models and shows how to systematically extract complete training samples through a single query, which poses a significant threat to the confidentiality of text - based training data sets.
### Motivation and threat model
The paper assumes that the attacker's goal is to extract specific samples from an organization's private data set, and the motivation may be to violate data confidentiality or misappropriate intellectual property. An attacker can exert influence in the following ways:
- **Data manipulation**: Modify the data set, inject specific samples or change existing samples.
- **Training code tampering**: Tamper with the training code or loss function.
- **Internal threat**: An insider or a legitimate user tampers with the training code.
To maintain covertness, an attacker cannot directly export data, nor can they significantly reduce the performance of the model. An attacker needs to ensure the authenticity of the extracted data, that is, the extracted samples are indeed from the original data set, not false data generated by the model.
### Background and related work
The paper reviews two related fields: backdoor attacks and data extraction attacks. The goal of backdoor attacks is to embed hidden functions during the training process so that the model produces the output specified by the attacker when a specific trigger pattern is input. Data extraction attacks obtain information about the data set through interaction with the model, including methods such as attribute inference, membership inference, and model inversion.
### Definition and implementation of memory backdoor
The paper formally defines the concept of memory backdoor and introduces the specific implementation of "Pixel Pirate". The memory backdoor generates a specific trigger pattern through a trigger function, making the model output the corresponding data fragment. By systematically traversing the index space, these fragments can be recombined to completely reconstruct the target data set.
### Specific methods for attacking visual models
The paper details how to implement the memory backdoor in predictive visual models, especially image classifiers. The main challenges include the output dimension being much smaller than the input dimension and the effective identification of the index trigger pattern. The paper proposes a block - reconstruction method and designs a four - dimensional index space and two trigger functions (pattern - based and code - based).
In conclusion, this paper reveals the potential threats to data privacy in neural network models by proposing memory backdoor attacks and provides specific implementation methods and detection means.