Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Chengyue Wu,Yixiao Ge,Qiushan Guo,Jiahao Wang,Zhixuan Liang,Zeyu Lu,Ying Shan,Ping Luo
DOI: https://doi.org/10.48550/arXiv.2405.07990
2024-05-14
Abstract:The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at
Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper introduces a comprehensive benchmark called Plot2Code for evaluating the ability of multimodal large language models (MLLMs) to generate code from scientific plots. Currently, these models have shown significant improvements in visual understanding, but their ability to convert plots into executable code has not been fully evaluated. Plot2Code consists of 132 carefully selected matplotlib plots, covering six types of charts, each accompanied by source code and descriptive instructions written by GPT-4. The models' output code and rendered images are evaluated in fine-grained manner using automated metrics such as code execution rate, text matching rate, and overall score from GPT-4V. The study finds that most existing MLLMs struggle with text-dense plots and heavily rely on text instructions. Plot2Code aims to guide the future development of MLLMs in visual coding proficiency.