Abstract:Recently, the success of Text-to-Image (T2I) models has led to the rise of numerous third-party platforms, which claim to provide cheaper API services and more flexibility in model options. However, this also raises a new security concern: Are these third-party services truly offering the models they claim? To address this problem, we propose the first T2I model verification method named Text-to-Image Model Verification via Non-Transferable Adversarial Attacks (TVN). The non-transferability of adversarial examples means that these examples are only effective on a target model and ineffective on other models, thereby allowing for the verification of the target model. TVN utilizes the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to optimize the cosine similarity of a prompt's text encoding, generating non-transferable adversarial prompts. By calculating the CLIP-text scores between the non-transferable adversarial prompts without perturbations and the images, we can verify if the model matches the claimed target model, based on a 3-sigma threshold. The experiments showed that TVN performed well in both closed-set and open-set scenarios, achieving a verification accuracy of over 90\%. Moreover, the adversarial prompts generated by TVN significantly reduced the CLIP-text scores of the target model, while having little effect on other models.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of verifying text - to - image (T2I) models in a black - box environment. Specifically, with the successful application of T2I models, many third - party platforms claim to provide cheaper API services and more diverse model choices. However, this has led to new security issues: do these third - party services actually provide the models they claim? #### Background problems 1. **Security of third - party platforms**: Some third - party platforms may claim to provide an expensive model (such as DALL - E 3), but actually provide a lower - cost model (such as Stable Diffusion v1.4). This behavior may lead to illegal profits and damage the rights and interests of users. 2. **Limitations of existing methods**: Existing model verification methods mainly focus on large language models (LLMs). They identify the model version by sending carefully designed queries and analyzing the responses. However, these methods are not applicable to T2I models because T2I models output images rather than text and cannot directly convey information about themselves. #### Solutions To solve the above problems, the author proposes the first method for verifying T2I models, called Text - to - Image Models Verification via Non - Transferable Adversarial Attacks (TVN). The main idea of TVN is: - **Generate non - transferable adversarial samples**: Optimize specific perturbations so that the adversarial samples are only effective for the target model and ineffective for other models. - **Calculate CLIP - text scores**: Calculate the CLIP - text score between the generated image and the original prompt, and determine whether the model is the target model according to the 3 - sigma threshold. #### Specific implementation - **NSGA - II optimization algorithm**: Use the Non - dominated Sorting Genetic Algorithm II (NSGA - II) to optimize the adversarial samples to ensure their non - transferability. - **Evaluation metrics**: Evaluate the effectiveness of TVN through metrics such as CLIP - text scores, accuracy, precision, recall, and F1 - Score. #### Experimental results - **Closed - set scenario**: TVN performs well in the closed - set scenario. The CLIP - text score for the target model is significantly reduced, while having a relatively small impact on other models. - **Open - set scenario**: TVN also performs well in the open - set scenario and can effectively distinguish the target model from other models. In conclusion, this paper proposes an innovative method to verify T2I models in a black - box environment, solves the possible fraud problems of current third - party platforms, and provides an effective solution for practical applications.

One Prompt to Verify Your Models: Black-Box Text-to-Image Models Verification via Non-Transferable Adversarial Attacks

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Combinational Backdoor Attack against Customized Text-to-Image Models

Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks

Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation

Natural Language Induced Adversarial Images

ProTIP: Probabilistic Robustness Verification on Text-to-Image Diffusion Models against Stochastic Perturbation

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers

An Image Is Worth 1000 Lies: Adversarial Transferability across Prompts on Vision-Language Models

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Multimodal Pragmatic Jailbreak on Text-to-image Models

Perception-guided Jailbreak against Text-to-Image Models

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Adversarial Prompt Tuning for Vision-Language Models

TextTricker: Loss-based and gradient-based adversarial attacks on text classification models

Position: Towards Implicit Prompt For Text-To-Image Models

Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks

DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization

Universal Prompt Optimizer for Safe Text-to-Image Generation