Abstract:Recently, text-to-image (T2I) synthesis has undergone significant advancements, particularly with the emergence of Large Language Models (LLM) and their enhancement in Large Vision Models (LVM), greatly enhancing the instruction-following capabilities of traditional T2I models. Nevertheless, previous methods focus on improving generation quality but introduce unsafe factors into prompts. We explore that appending specific camera descriptions to prompts can enhance safety performance. Consequently, we propose a simple and safe prompt engineering method (SSP) to improve image generation quality by providing optimal camera descriptions. Specifically, we create a dataset from multi-datasets as original prompts. To select the optimal camera, we design an optimal camera matching approach and implement a classifier for original prompts capable of automatically matching. Appending camera descriptions to original prompts generates optimized prompts for further LVM image generation. Experiments demonstrate that SSP improves semantic consistency by an average of 16% compared to others and safety metrics by 48.9%.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the quality and safety of generated images in the text - to - image (T2I) synthesis process without introducing unsafe factors. Specifically, although existing methods have improved image generation quality, they often optimize prompt words by randomly adding words, which may lead to changes in the original semantics and introduce unsafe factors. To solve this problem, the paper proposes a simple and safe automatic prompt engineering method (SSP), which improves image generation quality by adding optimal camera descriptions to prompt words and ensures the safety of the generation process. ### Main contributions 1. **Dataset release**: The authors released a new dataset for optimizing image - generation prompts, which is suitable for visual - prompt - optimization tasks. 2. **Proposing the SSP method**: A new method, SSP, is introduced, aiming to improve image - generation quality by providing optimal camera descriptions while avoiding changing the original content or introducing unsafe factors. 3. **Experimental verification**: Extensive experiments show that, compared with two powerful baseline methods, SSP improves prompt consistency by an average of 16%, text - image alignment by 5%, and safety metrics by 48.9%. 4. **Text - feature analysis**: Through text - feature analysis, the effectiveness of prompt engineering in large - model optimization is proven, which may inspire other prompt - driven optimization strategies. ### Method overview 1. **Dataset construction**: Original prompt words are collected from multiple public datasets, including MSCOCO, ImageNet, and DiffusionDB. GPT - 4 is used to label and summarize these data to generate the final set of original prompt words. 2. **Optimal camera selection**: Original prompt words are manually classified according to different shooting themes, and modified prompt words are generated by adding different camera descriptions. FID and CLIP Score are used to evaluate the generated images and select the optimal camera. 3. **Prompt - word optimization**: A BERT - based optimal - camera - matching method is designed to automatically match optimal camera descriptions for original prompt words and generate optimized prompt words. 4. **Image generation**: Optimized prompt words are input into GPT - 4 to generate images, and GPTQuery is used to ensure that the generated images are aligned with the prompt words. ### Experimental results - **Qualitative analysis**: Images generated by SSP are more realistic and beautiful in visual effects and are highly consistent with the input prompt words. - **Quantitative analysis**: SSP outperforms other baseline methods in multiple metrics such as FID, CLIP Score, and user studies. - **Safety detection**: Prompt words generated by SSP have the lowest score in the Detoxify text - toxicity evaluation, and the rejection rate of GPT - 4's built - in safety check is the lowest, indicating that the generated content is safer. ### Conclusion The paper proposes a simple and safe prompt - engineering method SSP, which improves the quality and safety of text - to - image synthesis by adding optimal camera descriptions. Although there are some limitations, such as the authenticity evaluation relying on the FID metric and the lack of comparison with other LVMs, this method performs well in multiple aspects and has high application potential. Future research will further improve the authenticity - evaluation metrics and explore more general prompt - engineering methods.

SSP: A Simple and Safe automatic Prompt engineering method towards realistic image synthesis on LVM

SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation

What Do You Want? User-centric Prompt Generation for Text-to-image Synthesis via Multi-turn Guidance

Batch-Instructed Gradient for Prompt Evolution:Systematic Prompt Optimization for Enhanced Text-to-Image Synthesis

SurrogatePrompt: Bypassing the Safety Filter of Text-to-Image Models via Substitution

Adaptive Multi-Modality Prompt Learning

Enhance Image-to-Image Generation with LLaVA-generated Prompts

Mutual Prompt Leaning for Vision Language Models

Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding

PromptFix: You Prompt and We Fix the Photo

BSPA: Exploring Black-box Stealthy Prompt Attacks Against Image Generators

SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

Optimizing Prompts for Text-to-Image Generation

Dynamic Prompt Optimizing for Text-to-Image Generation