Membership Inference Attacks against Large Vision-Language Models

Zhan Li,Yongtao Wu,Yihang Chen,Francesco Tonin,Elias Abad Rocamora,Volkan Cevher
2024-11-05
Abstract:Large vision-language models (VLLMs) exhibit promising capabilities for processing multi-modal tasks across various application scenarios. However, their emergence also raises significant data security concerns, given the potential inclusion of sensitive information, such as private photos and medical records, in their training datasets. Detecting inappropriately used data in VLLMs remains a critical and unresolved issue, mainly due to the lack of standardized datasets and suitable methodologies. In this study, we introduce the first membership inference attack (MIA) benchmark tailored for various VLLMs to facilitate training data detection. Then, we propose a novel MIA pipeline specifically designed for token-level image detection. Lastly, we present a new metric called MaxRényi-K%, which is based on the confidence of the model output and applies to both text and image data. We believe that our work can deepen the understanding and methodology of MIAs in the context of VLLMs. Our code and datasets are available at <a class="link-external link-https" href="https://github.com/LIONS-EPFL/VL-MIA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Cryptography and Security,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is Membership Inference Attacks (MIAs) in large Vision - Language Models (VLLMs). Specifically, the research aims to: 1. **Construct an MIA benchmark for VLLMs**: Due to the lack of standardized datasets and appropriate methodologies, detecting the inappropriate use of training data in VLLMs is a crucial and unsolved problem. To this end, the authors introduce the first MIA benchmark specifically for various VLLMs. 2. **Develop new MIA methods**: To meet the more common unimodal detection requirements in practical applications, the authors propose a new MIA pipeline specifically for image - based token - level detection. In addition, they also introduce a new metric - MaxRényi - K% - which is based on the confidence of model outputs and is applicable to text and image data. 3. **Improve the understanding and methodology of MIAs**: Through the above work, the authors hope to deepen the understanding and methodology of MIAs in VLLMs, thereby better protecting user privacy and preventing knowledge leakage. ### Specific Problem Description With the development of VLLMs, these models perform excellently in handling multimodal tasks, but at the same time, they also raise data security issues, especially when the training datasets contain sensitive information (such as private photos and medical records). Traditional MIA methods mainly focus on a single modality (such as pure text or pure image), and the multimodal characteristics of VLLMs make it difficult to directly apply existing MIA methods. Therefore, this paper proposes the following innovations: - **Construct the VL - MIA benchmark**: Using resources such as Flickr and GPT - 4, a dataset containing image and text MIA tasks is constructed, which is applicable to multiple VLLMs. - **Cross - modal MIA pipeline**: A new cross - modal MIA pipeline is developed, which can detect whether a single image or description belongs to the training set. This pipeline can not only use image slices, but also use instructions and description slices to calculate statistics. - **MaxRényi - K% metric**: A new metric is proposed, which is based on Rényi entropy, can adapt to image and text MIAs, and can be further modified in a target - based manner. ### Summary The core objective of this paper is to improve the understanding and protection ability of MIAs in VLLMs by constructing new benchmarks, developing new MIA methods, and introducing new metrics, thereby better protecting user privacy and data security.