Abstract:Translating natural language to visualization (NL2VIS) has shown great promise for visual data analysis, but it remains a challenging task that requires multiple low-level implementations, such as natural language processing and visualization design. Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. However, the lack of a comprehensive and reliable benchmark hinders our understanding of LLMs' capabilities in visualization generation. In this paper, we address this gap by proposing a new NL2VIS benchmark called VisEval. Firstly, we introduce a high-quality and large-scale dataset. This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths. Secondly, we advocate for a comprehensive automated evaluation methodology covering multiple dimensions, including validity, legality, and readability. By systematically scanning for potential issues with a number of heterogeneous checkers, VisEval provides reliable and trustworthy evaluation outcomes. We run VisEval on a series of state-of-the-art LLMs. Our evaluation reveals prevalent challenges and delivers essential insights for future advancements.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the natural language to visualization generation (NL2VIS) task, there is a lack of a comprehensive and reliable benchmarking tool to evaluate the ability of large - language models (LLMs) to generate visualizations. Current methods are deficient in the quality and scale of datasets, the comprehensiveness of evaluation metrics, and the reliability of evaluation methods. These problems impede in - depth understanding and evaluation of the capabilities of LLMs in visualization - generation tasks. To fill this gap, the authors propose a new NL2VIS benchmarking tool - VisEval. VisEval aims to systematically scan issues of effectiveness, legality, and readability in generating visualizations by constructing a high - quality large - scale dataset and designing a comprehensive automated evaluation framework, providing reliable and trustworthy evaluation results. Specifically: 1. **Constructing a high - quality large - scale dataset**: VisEval contains 2,524 representative natural - language queries, covering 146 databases, and is paired with accurately annotated ground truth. The dataset has been strictly screened and expert - reviewed to ensure the clarity, rationality, and non - repetitiveness of the queries. 2. **Multi - dimensional evaluation framework**: - **Validity**: Check whether the generated code can successfully render the visualization chart. - **Legality**: Verify whether the generated visualization chart meets the query requirements, including chart type, data mapping, etc. - **Readability**: Evaluate whether the layout, color, font, etc. of the chart contribute to the effective conveyance of information. 3. **Automated reliable evaluation**: Reduce the burden of manual evaluation through automated methods and ensure the objectivity and reliability of evaluation. VisEval utilizes multiple checkers (such as code - execution checkers, chart - decomposition checkers, and readability evaluators) to systematically identify potential problems. Through these measures, VisEval can not only reveal common challenges and limitations of existing LLMs in NL2VIS tasks, but also provide valuable insights for future research and development.

VisEval: A Benchmark for Data Visualization in the Era of Large Language Models

Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study

Visualization Generation with Large Language Models: An Evaluation

Nvbench: A Large-Scale Synthesized Dataset for Cross-Domain Natural Language to Visualization Task

Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation

Prompt4Vis: Prompting Large Language Models with Example Mining and Schema Filtering for Tabular Data Visualization

Visualization Literacy of Multimodal Large Language Models: A Comparative Study

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

A Survey on Evaluation of Large Language Models

Generating Analytic Specifications for Data Visualization from Natural Language Queries using Large Language Models

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Promises and Pitfalls: Using Large Language Models to Generate Visualization Items

VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models

A Survey on Evaluation of Large Language ModelsJust Accepted

LogEval: A Comprehensive Benchmark Suite for Large Language Models In Log Analysis

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases

LEVA: Using Large Language Models to Enhance Visual Analytics

LLMEval: A Preliminary Study on How to Evaluate Large Language Models