DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design

Kevin Lin,Zhengyuan Yang,Linjie Li,Jianfeng Wang,Lijuan Wang
DOI: https://doi.org/10.48550/arXiv.2310.15144
2023-10-24
Abstract:We introduce DEsignBench, a text-to-image (T2I) generation benchmark tailored for visual design scenarios. Recent T2I models like DALL-E 3 and others, have demonstrated remarkable capabilities in generating photorealistic images that align closely with textual inputs. While the allure of creating visually captivating images is undeniable, our emphasis extends beyond mere aesthetic pleasure. We aim to investigate the potential of using these powerful models in authentic design contexts. In pursuit of this goal, we develop DEsignBench, which incorporates test samples designed to assess T2I models on both "design technical capability" and "design application scenario." Each of these two dimensions is supported by a diverse set of specific design categories. We explore DALL-E 3 together with other leading T2I models on DEsignBench, resulting in a comprehensive visual gallery for side-by-side comparisons. For DEsignBench benchmarking, we perform human evaluations on generated images in DEsignBench gallery, against the criteria of image-text alignment, visual aesthetic, and design creativity. Our evaluation also considers other specialized design capabilities, including text rendering, layout composition, color harmony, 3D design, and medium style. In addition to human evaluations, we introduce the first automatic image generation evaluator powered by GPT-4V. This evaluator provides ratings that align well with human judgments, while being easily replicable and cost-efficient. A high-resolution version is available at <a class="link-external link-https" href="https://github.com/design-bench/design-bench.github.io/raw/main/designbench.pdf?download=" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use advanced text - to - image (T2I) generation models, such as DALL - E 3, to assist in visual design. Specifically, the authors focus on the application potential of these models in actual design scenarios, not just on generating visually pleasing images. To achieve this goal, they developed a new evaluation benchmark - DEsignBench, which is used to evaluate the technical capabilities and application scenarios of T2I models in visual design. ### Main research questions: 1. **Design technical capabilities**: Evaluate the performance of T2I models in terms of core design technical capabilities, including but not limited to: - **Text rendering and typesetting**: Can the model accurately render text, especially in complex scenarios? - **Layout and composition**: Can the model effectively handle multi - panel layouts, charts, calendars and other design elements? - **Color harmony**: Can the model generate design works that conform to color theory? - **Medium and style**: Can the model generate works in different media and artistic styles, such as sketches, 3D sculptures, etc.? - **3D and cinematography**: Can the model generate images with realistic 3D effects and cinematography techniques? 2. **Design application scenarios**: Evaluate the application capabilities of T2I models in actual design scenarios, including but not limited to: - **Infographic design**: Such as storybooks, posters, menus, etc. - **Animation / game design**: Such as movie scenes, comic strips, game scenes, etc. - **Product design**: Such as stickers, jewelry, clothing, etc. - **Visual art design**: Such as 3D sculptures, historical art, time - travel themes, etc. ### Research methods: - **DEsignBench benchmark**: The authors constructed the DEsignBench benchmark, which contains 215 evaluation prompts, each marked with the corresponding design category label. - **Generated image library**: Use the most advanced T2I models (such as SDXL, Midjourney, Ideogram, Firefly2, DALL - E 3) to generate images and organize these images into a side - by - side comparison gallery. - **Human evaluation**: Conduct human evaluation on the generated images, mainly based on three criteria: visual aesthetics, image - text alignment, and design creativity. In addition, five other design - specific capabilities are also considered: text rendering, layout and composition, color harmony, 3D and cinematography, medium and style. - **Automatic evaluation**: Introduce an automatic evaluation system based on GPT - 4V, which provides results that are highly consistent with human evaluation and are more cost - effective and reproducible. ### Contributions: 1. **Explore the application of DALL - E 3 in visual design**: Through the DEsignBench benchmark, evaluate the performance of DALL - E 3 in multiple design scenarios. 2. **Propose an automatic evaluation method based on GPT - 4V**: Provide a cost - effective and reproducible automatic evaluation method, and its results are highly consistent with human evaluation. 3. **Collect the DEsignBench gallery**: Display the images generated by different T2I models for side - by - side comparison. Through these methods and contributions, the paper aims to promote the application of T2I models in actual design scenarios, enabling them not only to generate high - quality images, but also to provide valuable assistance in the design process.