Abstract:Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g. Flamingo), image-text matching models (e.g. CLIP), and text-to-image generation models (e.g. Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to systematically outline the research progress in prompt engineering on Vision-Language Foundation Models (VLMs). Specifically, the authors focus on three types of vision-language models: 1. **Multimodal to Text Generation Models** (e.g., Flamingo) 2. **Image-Text Matching Models** (e.g., CLIP) 3. **Text to Image Generation Models** (e.g., Stable Diffusion) For each model type, the paper summarizes the model overview, prompt methods, prompt-based applications, and corresponding responsibility and integrity issues. Additionally, the paper discusses the commonalities and differences between prompt engineering on vision-language models, language models, and vision models, and summarizes challenges, future directions, and research opportunities to promote further research in this field. ### Overview of Main Content - **Introduction to Prompt Engineering**: Prompt engineering is a technique that adapts pre-trained models to new tasks by inputting task-specific prompts. Prompts can be manually created natural language instructions or automatically generated natural language instructions or vector representations. - **Comparison of Traditional Paradigms and Prompt Engineering**: The traditional machine learning paradigm requires a large amount of labeled data and training or fine-tuning pre-trained models from scratch. In contrast, prompt engineering requires only a small amount of labeled data to adapt to new tasks and does not require updating model parameters. - **Classification of Prompt Methods**: Prompt methods are classified into hard prompts (readable natural language prompts) and soft prompts (continuous vector representations). Hard prompts are further divided into task instructions, context learning, retrieval-based prompts, and chain-of-thought prompts. - **Model Fusion Modules**: Introduces two common fusion modules in vision-language models: encoder-decoder structure and decoder-only structure. - **Applications of Prompt Methods**: Discusses in detail the prompt methods and their applications in different models, including specific practices of models such as VL-T5, SimVLM, and OFA. - **Challenges and Future Directions**: Summarizes the current challenges faced by prompt engineering and proposes future research directions.

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

Review of Large Vision Models and Visual Prompt Engineering

Prompt learning in computer vision: a survey

Towards Goal-oriented Prompt Engineering for Large Language Models: A Survey

A Survey of Prompt Engineering Methods in Large Language Models for Different NLP Tasks

A Brief History of Prompt: Leveraging Language Models. (Through Advanced Prompting)

Visual Prompting in Multimodal Large Language Models: A Survey

A Communication Theory Perspective on Prompting Engineering Methods for Large Language Models

Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review

Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Prompt Engineering a Prompt Engineer

Exploring Prompt Engineering: A Systematic Review with SWOT Analysis

Efficient Prompting Methods for Large Language Models: A Survey

Prompting Frameworks for Large Language Models: A Survey

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

Mutual Prompt Leaning for Vision Language Models

Revisiting Prompt Pretraining of Vision-Language Models