Abstract:We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.

What problem does this paper attempt to address?

The paper attempts to address the problem of generating "Illustrated Instructions," which are visual guides customized according to user needs. Specifically, the authors identify the unique requirements of this task and measure the effectiveness, consistency, and accuracy of the generated content through a series of automatic and manual evaluation metrics. The paper proposes a method called StackedDiffusion, which combines large language models (LLMs) and powerful text-to-image generative diffusion models to create these illustrated instructions. ### Main Issues 1. **Generating Illustrated Instructions**: Existing large language models (LLMs) can generate textual instructions but cannot produce visual content. This is a significant limitation for many tasks that require visual inspection, such as cooking and repairing. The goal of the paper is to develop a method that can generate not only textual instructions but also matching images. 2. **Improving the Quality of Generated Content**: The generated content needs to meet three main requirements: - **Goal Faithfulness**: The generated images should be relevant to the user's goal. - **Step Faithfulness**: The generated images should accurately reflect the content of each step. - **Cross-Image Consistency**: Multiple generated images should be consistent with each other to avoid discrepancies. ### Solution The authors propose the StackedDiffusion model, which combines large language models and text-to-image generative diffusion models to achieve the above goals through the following techniques: - **Spatial Tiling**: Spatially tiles the latent representations of multiple images to generate multiple images simultaneously, ensuring cross-image consistency. - **Text Embedding Concatenation**: Concatenates the embeddings of goal text and step text to reduce the loss of long text information. - **Step-Positional Encoding**: Adds positional encoding to each step to better distinguish different steps. ### Experimental Results - **Performance Comparison**: StackedDiffusion significantly outperforms baseline methods on multiple evaluation metrics, including goal faithfulness, step faithfulness, and cross-image consistency. - **Human Evaluation**: In human evaluations, StackedDiffusion performs significantly better than other methods and even surpasses human-generated articles in some cases. ### Contributions 1. **Introducing a New Task**: Defines the new task of "Illustrated Instructions" and proposes corresponding evaluation metrics. 2. **Proposing a New Method**: Introduces the StackedDiffusion model, which can generate high-quality illustrated instructions without adding extra parameters. 3. **Experimental Validation**: Extensively validates the effectiveness of StackedDiffusion through experiments, demonstrating its superior performance on multiple metrics. 4. **New Applications**: Showcases the new capabilities of StackedDiffusion, including personalized guidance, goal suggestions, and error correction, which go far beyond the capabilities of static articles.

Generating Illustrated Instructions

Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

Coherent Zero-Shot Visual Instruction Generation

Instruct-Imagen: Image Generation with Multi-modal Instruction

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Instructional Video Generation

Neurosymbolic AI for Enhancing Instructability in Generative AI

Guiding Instruction-based Image Editing via Multimodal Large Language Models

ClickDiffusion: Harnessing LLMs for Interactive Precise Image Editing

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

Instruct-SCTG: Guiding Sequential Controlled Text Generation through Instructions

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Interactive Visual Assessment for Text-to-Image Generation Models

Generate Subgoal Images Before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts

LLMGA: Multimodal Large Language Model based Generation Assistant