Abstract:We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at <a class="link-external link-https" href="https://crfm.stanford.edu/helm/image2struct/v1.0.1/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper "Image2Struct: Benchmarking Structure Extraction from Vision-Language Models" aims to address the issue of evaluating the capability of Vision-Language Models (VLMs) in extracting structures from images. Specifically, the authors propose a benchmark named **Image2Struct** to assess the ability of VLMs to generate underlying structures (such as LaTeX code or HTML code) from images. ### Main Issues 1. **Diverse Real-World Applications**: Existing benchmarks often fail to capture the variety of real-world application scenarios, especially those requiring the generation of complex structures. 2. **Automated Evaluation Methods**: Many existing benchmarks rely on manual evaluation, which is both expensive and time-consuming. Evaluation methods based on multiple-choice questions cannot represent many real-world applications, such as code generation. 3. **Continuous Data Stream**: Existing benchmark datasets are usually static and lack dynamic updates, failing to reflect the latest data changes. ### Solution The **Image2Struct** benchmark addresses the above issues in the following ways: 1. **Covering Real-World Application Scenarios**: Image2Struct covers multiple real-world application scenarios, including converting webpage screenshots to HTML, converting images of mathematical formulas to LaTeX code, and converting images of musical scores to LilyPond code. 2. **Fully Automated Evaluation**: The entire evaluation process is fully automated, requiring no human intervention. The structures generated by VLMs are rendered into images and then compared with the input images to generate similarity scores. 3. **Dynamic Data Stream**: Image2Struct downloads the latest data from active online communities, ensuring the freshness and diversity of the test data. ### Evaluation Method - **Three-Stage Process**: 1. **Input Image**: Input the image into the VLM to generate the structure (e.g., LaTeX code). 2. **Rendering**: Use a specific renderer (e.g., TeX engine) to render the generated structure into an image. 3. **Comparison**: Compare the rendered image with the input image to calculate the similarity score. - **Similarity Metrics**: Various image similarity metrics are used, including pixel similarity, cosine similarity between Inception vectors (CIS), Learned Perceptual Image Patch Similarity (LPIPS), Structural Similarity Index (SSIM), and Earth Mover's Distance Similarity (EMS). ### Experimental Results - **Model Performance**: 14 well-known VLMs were evaluated, revealing significant performance differences among the models. Overall, closed API models performed better than open-weight models. - **Task Difficulty**: The difficulty of different tasks varied. For example, GPT-4 Omni performed best on webpage and LaTeX tasks but poorly on musical score tasks. - **Room for Improvement**: Although some models performed well on certain tasks, overall, there is still significant room for improvement in the performance of all models on these tasks. ### Conclusion Image2Struct provides a comprehensive and automated benchmarking framework that effectively evaluates the performance of VLMs in extracting structures from images. By covering diverse real-world application scenarios and dynamic data streams, Image2Struct offers important references for the research and development of VLMs.

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation

StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

Teaching Structured Vision&Language Concepts to Vision&Language Models

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond

TouchStone: Evaluating Vision-Language Models by Language Models

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Learning to Extract Structured Entities Using Language Models

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation

PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation

CONSTRUCTURE: Benchmarking CONcept STRUCTUre REasoning for Multimodal Large Language Models

Scalable Performance Analysis for Vision-Language Models

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective

Building and better understanding vision-language models: insights and future directions

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization