Generating Illustrated Instructions

Sachit Menon,Ishan Misra,Rohit Girdhar
2024-04-13
Abstract:We introduce the new task of generating Illustrated Instructions, i.e., visual instructions customized to a user's needs. We identify desiderata unique to this task, and formalize it through a suite of automatic and human evaluation metrics, designed to measure the validity, consistency, and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion, which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases, users even prefer it to human-generated articles. Most notably, it enables various new and exciting applications far beyond what static articles on the web can provide, such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia
What problem does this paper attempt to address?
The paper attempts to address the problem of generating "Illustrated Instructions," which are visual guides customized according to user needs. Specifically, the authors identify the unique requirements of this task and measure the effectiveness, consistency, and accuracy of the generated content through a series of automatic and manual evaluation metrics. The paper proposes a method called StackedDiffusion, which combines large language models (LLMs) and powerful text-to-image generative diffusion models to create these illustrated instructions. ### Main Issues 1. **Generating Illustrated Instructions**: Existing large language models (LLMs) can generate textual instructions but cannot produce visual content. This is a significant limitation for many tasks that require visual inspection, such as cooking and repairing. The goal of the paper is to develop a method that can generate not only textual instructions but also matching images. 2. **Improving the Quality of Generated Content**: The generated content needs to meet three main requirements: - **Goal Faithfulness**: The generated images should be relevant to the user's goal. - **Step Faithfulness**: The generated images should accurately reflect the content of each step. - **Cross-Image Consistency**: Multiple generated images should be consistent with each other to avoid discrepancies. ### Solution The authors propose the StackedDiffusion model, which combines large language models and text-to-image generative diffusion models to achieve the above goals through the following techniques: - **Spatial Tiling**: Spatially tiles the latent representations of multiple images to generate multiple images simultaneously, ensuring cross-image consistency. - **Text Embedding Concatenation**: Concatenates the embeddings of goal text and step text to reduce the loss of long text information. - **Step-Positional Encoding**: Adds positional encoding to each step to better distinguish different steps. ### Experimental Results - **Performance Comparison**: StackedDiffusion significantly outperforms baseline methods on multiple evaluation metrics, including goal faithfulness, step faithfulness, and cross-image consistency. - **Human Evaluation**: In human evaluations, StackedDiffusion performs significantly better than other methods and even surpasses human-generated articles in some cases. ### Contributions 1. **Introducing a New Task**: Defines the new task of "Illustrated Instructions" and proposes corresponding evaluation metrics. 2. **Proposing a New Method**: Introduces the StackedDiffusion model, which can generate high-quality illustrated instructions without adding extra parameters. 3. **Experimental Validation**: Extensively validates the effectiveness of StackedDiffusion through experiments, demonstrating its superior performance on multiple metrics. 4. **New Applications**: Showcases the new capabilities of StackedDiffusion, including personalized guidance, goal suggestions, and error correction, which go far beyond the capabilities of static articles.