Abstract:Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs' capabilities. We have open-sourced our code and benchmark at <a class="link-external link-https" href="https://github.com/HumanEval-V/HumanEval-V-Benchmark" rel="external noopener nofollow">this https URL</a>.

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation?

IDEA-Bench: How Far are Generative Models from Professional Designing?

DesignProbe: A Graphic Design Benchmark for Multimodal Large Language Models

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Benchmarking Language Model Creativity: A Case Study on Code Generation

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

WebApp1K: A Practical Code-Generation Benchmark for Web App Development

VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs

DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review