Abstract:Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs' capabilities. We have open-sourced our code and benchmark at <a class="link-external link-https" href="https://github.com/HumanEval-V/HumanEval-V-Benchmark" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to fill the gap in the evaluation of large multimodal models (LMMs) in coding tasks. Despite significant progress in visual perception and understanding capabilities, there is currently a lack of coding benchmarks that can strictly evaluate these models in tasks emphasizing visual reasoning. To address this issue, the authors introduce **HumanEval - V**, a novel and lightweight benchmark specifically designed to evaluate the visual understanding and reasoning capabilities of LMMs in code generation. ### Specific problem descriptions 1. **Deficiencies of existing benchmarks**: - Current multimodal benchmarks mainly focus on multiple - choice or open - ended questions in common - sense reasoning, ignoring more complex reasoning scenarios such as coding tasks. - Coding tasks are very valuable for evaluating the complex reasoning capabilities of LLMs, but existing multimodal benchmarks fail to fully cover this area. 2. **Importance of visual information**: - In HumanEval - V, visual information plays a crucial role in solving coding tasks. Each task contains an image input, a Python function signature, and a problem description. - The model needs to complete the code solution according to the provided visual context and function signature and be verified by pre - defined test cases. 3. **Evaluating the capabilities of models**: - The authors evaluated 19 state - of - the - art LMMs using HumanEval - V, revealing significant challenges in the current models in terms of visual reasoning and coding capabilities. - The experimental results show that even leading proprietary models (such as GPT - 4o and Claude 3.5 Sonnet) perform far from expectations on HumanEval - V, and open - source models perform even worse. ### Main findings 1. **Performance gap**: - The pass@1 score of proprietary models (such as GPT - 4o) on HumanEval - V is only 13%, and even on pass@10 it is only 36.4%. - Open - source models perform worse. Among models with parameter numbers ranging from 4B to 76B, none of the models has a pass@1 exceeding 4%. 2. **Hallucination errors caused by over - fitting**: - Many models rely on the context of the original problem when generating solutions, rather than the new version of the task in the benchmark. For example, GPT - 4o and Claude 3.5 Sonnet wrongly assume that the numbers in the figure are arranged clockwise and that line segments can only intersect inside the circle. 3. **Relationship between parameter scale and performance**: - A larger parameter scale does not necessarily guarantee better performance. For example, some smaller models (such as Phi - 3 - Vision and InternVL - 2) perform better than larger models on some tasks. 4. **Unique weaknesses**: - HumanEval - V exposes the unique weaknesses of current LMMs in visual reasoning and coding tasks. These models may perform well on other multimodal benchmarks, but perform poorly on HumanEval - V. ### Conclusion HumanEval - V provides a new benchmark for evaluating the coding tasks of LMMs in terms of visual understanding and reasoning capabilities. The experimental results show that current LMMs still face significant challenges in this area, and future research needs to further enhance the visual reasoning and coding capabilities of these models.

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

MHPP: Exploring the Capabilities and Limitations of Language Models Beyond Basic Code Generation

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

VLM-Eval: A General Evaluation on Video Large Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

A Survey on Benchmarks of Multimodal Large Language Models