Abstract:Coding tasks have been valuable for evaluating Large Language Models (LLMs), as they demand the comprehension of high-level instructions, complex reasoning, and the implementation of functional programs -- core capabilities for advancing Artificial General Intelligence. Despite the progress in Large Multimodal Models (LMMs), which extend LLMs with visual perception and understanding capabilities, there remains a notable lack of coding benchmarks that rigorously assess these models, particularly in tasks that emphasize visual reasoning. To address this gap, we introduce HumanEval-V, a novel and lightweight benchmark specifically designed to evaluate LMMs' visual understanding and reasoning capabilities through code generation. HumanEval-V includes 108 carefully crafted, entry-level Python coding tasks derived from platforms like CodeForces and Stack Overflow. Each task is adapted by modifying the context and algorithmic patterns of the original problems, with visual elements redrawn to ensure distinction from the source, preventing potential data leakage. LMMs are required to complete the code solution based on the provided visual context and a predefined Python function signature outlining the task requirements. Every task is equipped with meticulously handcrafted test cases to ensure a thorough and reliable evaluation of model-generated solutions. We evaluate 19 state-of-the-art LMMs using HumanEval-V, uncovering significant challenges. Proprietary models like GPT-4o achieve only 13% pass@1 and 36.4% pass@10, while open-weight models with 70B parameters score below 4% pass@1. Ablation studies further reveal the limitations of current LMMs in vision reasoning and coding capabilities. These results underscore key areas for future research to enhance LMMs' capabilities. We have open-sourced our code and benchmark at <a class="link-external link-https" href="https://github.com/HumanEval-V/HumanEval-V-Benchmark" rel="external noopener nofollow">this https URL</a>.

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

MMBench: Is Your Multi-modal Model an All-around Player?

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Are We on the Right Way for Evaluating Large Vision-Language Models?

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MMR: Evaluating Reading Ability of Large Multimodal Models

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

A Survey on Benchmarks of Multimodal Large Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning