Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Sukmin Yun,Haokun Lin,Rusiru Thushara,Mohammad Qazim Bhat,Yongxin Wang,Zutao Jiang,Mingkai Deng,Jinhong Wang,Tianhua Tao,Junbo Li,Haonan Li,Preslav Nakov,Timothy Baldwin,Zhengzhong Liu,Eric P. Xing,Xiaodan Liang,Zhiqiang Shen

2024-06-29

Abstract:Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at <a class="link-external link-https" href="https://github.com/MBZUAI-LLM/web2code" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

This paper introduces a large-scale dataset and evaluation framework called Web2Code, aiming to address the insufficient ability of current multimodal large language models (MLLMs) in understanding and generating web screenshots and their corresponding HTML codes. Existing MLLMs perform poorly when dealing with web screenshots and fail to accurately generate HTML codes representing the webpage states, which limits applications such as UI prototype design, automated agents, and accessibility. To fill this gap, Web2Code provides a new large-scale dataset and evaluation benchmarks, including the Web Understanding Benchmark (WUB) and the Web Code Generation Benchmark (WCGB). The dataset is augmented and generated using GPT-3.5 and GPT-4, and it consists of pairs of web images, instructions, HTML codes, and question-answer pairs about the webpage content to facilitate models' comprehensive understanding of web information. In addition, Web2Code proposes a novel evaluation method that compares the generated webpage images with the original screenshots to assess the models' performance in web understanding and code generation tasks. Experiments show that fine-tuning using Web2Code not only improves the models' translation ability from images to HTML codes but also enhances their perception and reasoning abilities in general tasks. The paper also compares different datasets and models, demonstrating the beneficial effects of the Web2Code dataset in enhancing models' web-related capabilities and task automation without compromising their performance in other domains. In summary, the paper aims to enhance the ability of multimodal large language models in understanding and generating web content, particularly HTML codes, to facilitate broader task automation and web-based content generation.

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

Harnessing Webpage UIs for Text-Rich Visual Understanding

Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

McEval: Massively Multilingual Code Evaluation

VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Understanding HTML with Large Language Models

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

A Survey on Benchmarks of Multimodal Large Language Models