Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

Josselin Somerville Roberts,Tony Lee,Chi Heem Wong,Michihiro Yasunaga,Yifan Mai,Percy Liang
2024-10-30
Abstract:We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at <a class="link-external link-https" href="https://crfm.stanford.edu/helm/image2struct/v1.0.1/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper "Image2Struct: Benchmarking Structure Extraction from Vision-Language Models" aims to address the issue of evaluating the capability of Vision-Language Models (VLMs) in extracting structures from images. Specifically, the authors propose a benchmark named **Image2Struct** to assess the ability of VLMs to generate underlying structures (such as LaTeX code or HTML code) from images. ### Main Issues 1. **Diverse Real-World Applications**: Existing benchmarks often fail to capture the variety of real-world application scenarios, especially those requiring the generation of complex structures. 2. **Automated Evaluation Methods**: Many existing benchmarks rely on manual evaluation, which is both expensive and time-consuming. Evaluation methods based on multiple-choice questions cannot represent many real-world applications, such as code generation. 3. **Continuous Data Stream**: Existing benchmark datasets are usually static and lack dynamic updates, failing to reflect the latest data changes. ### Solution The **Image2Struct** benchmark addresses the above issues in the following ways: 1. **Covering Real-World Application Scenarios**: Image2Struct covers multiple real-world application scenarios, including converting webpage screenshots to HTML, converting images of mathematical formulas to LaTeX code, and converting images of musical scores to LilyPond code. 2. **Fully Automated Evaluation**: The entire evaluation process is fully automated, requiring no human intervention. The structures generated by VLMs are rendered into images and then compared with the input images to generate similarity scores. 3. **Dynamic Data Stream**: Image2Struct downloads the latest data from active online communities, ensuring the freshness and diversity of the test data. ### Evaluation Method - **Three-Stage Process**: 1. **Input Image**: Input the image into the VLM to generate the structure (e.g., LaTeX code). 2. **Rendering**: Use a specific renderer (e.g., TeX engine) to render the generated structure into an image. 3. **Comparison**: Compare the rendered image with the input image to calculate the similarity score. - **Similarity Metrics**: Various image similarity metrics are used, including pixel similarity, cosine similarity between Inception vectors (CIS), Learned Perceptual Image Patch Similarity (LPIPS), Structural Similarity Index (SSIM), and Earth Mover's Distance Similarity (EMS). ### Experimental Results - **Model Performance**: 14 well-known VLMs were evaluated, revealing significant performance differences among the models. Overall, closed API models performed better than open-weight models. - **Task Difficulty**: The difficulty of different tasks varied. For example, GPT-4 Omni performed best on webpage and LaTeX tasks but poorly on musical score tasks. - **Room for Improvement**: Although some models performed well on certain tasks, overall, there is still significant room for improvement in the performance of all models on these tasks. ### Conclusion Image2Struct provides a comprehensive and automated benchmarking framework that effectively evaluates the performance of VLMs in extracting structures from images. By covering diverse real-world application scenarios and dynamic data streams, Image2Struct offers important references for the research and development of VLMs.