Abstract:Large language models (LLMs) have achieved remarkable success in natural language processing (NLP), demonstrating significant capabilities in processing and understanding text data. However, recent studies have identified limitations in LLMs' ability to reason about graph-structured data. To address this gap, we introduce GraphEval2000, the first comprehensive graph dataset, comprising 40 graph data structure problems along with 2000 test cases. Additionally, we introduce an evaluation framework based on GraphEval2000, designed to assess the graph reasoning abilities of LLMs through coding challenges. Our dataset categorizes test cases into four primary and four sub-categories, ensuring a comprehensive evaluation. We evaluate eight popular LLMs on GraphEval2000, revealing that LLMs exhibit a better understanding of directed graphs compared to undirected ones. While private LLMs consistently outperform open-source models, the performance gap is narrowing. Furthermore, to improve the usability of our evaluation framework, we propose Structured Symbolic Decomposition (SSD), an instruction-based method designed to enhance LLM performance on GraphEval2000. Results show that SSD improves the performance of GPT-3.5, GPT-4, and GPT-4o on complex graph problems, with an increase of 11.11\%, 33.37\%, and 33.37\%, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient reasoning ability of large language models (LLMs) when processing graph - structured data. Although LLMs perform excellently in natural language processing (NLP) tasks, they have significant limitations in processing complex graph - structured data and multi - step reasoning processes. Specifically: 1. **Limitations of Existing Research**: Current research shows that although LLMs can handle basic graph - related queries, they perform poorly when faced with more complex graph structures and multi - step reasoning tasks. 2. **Lack of Evaluation Benchmarks**: Previously, there was no comprehensive benchmark test set to systematically evaluate the reasoning ability of LLMs on graph - structured data. To solve these problems, the paper introduced **GraphEval2000**, a data set containing 40 graph data structure problems and 2,000 test cases. Through this data set, researchers can evaluate the performance of LLMs in graph reasoning tasks and reveal their performance differences on different types of graphs (such as sparse graphs, planar graphs, regular graphs, and complete graphs). In addition, the paper also proposed an instruction - based method - **Structured Symbolic Decomposition (SSD)** - which aims to enhance the graph reasoning ability of LLMs by decomposing complex tasks into smaller symbolic subtasks. Experimental results show that the SSD method significantly improves the performance of models such as GPT - 3.5, GPT - 4, and GPT - 4o on complex graph problems. ### Main Contributions: 1. **Constructing the GraphEval2000 Data Set**: This is the first data set specifically designed to evaluate the graph reasoning ability of LLMs, containing 40 data structure problems and 2,000 test cases. 2. **Proposing an Evaluation Framework**: Based on GraphEval2000, an evaluation framework with real - time feedback is provided to help users iteratively improve model performance. 3. **Establishing Benchmarks**: Benchmark tests were carried out on eight popular LLMs, revealing their performance differences on different types of graph structures. 4. **Proposing the SSD Method**: By decomposing complex tasks into cognitive steps and action steps, the reasoning ability of LLMs on complex graph problems is significantly improved. In summary, this paper aims to fill the gap in the graph reasoning field of LLMs and provide tools and methods to improve the performance of these models when processing graph - structured data.

GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

GPT4Graph: Can Large Language Models Understand Graph Structured Data ? an Empirical Evaluation and Benchmarking.

Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

Large Language Models on Graphs: A Comprehensive Survey

GraphLLM: Boosting Graph Reasoning Ability of Large Language Model

How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension

GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability

GraphArena: Benchmarking Large Language Models on Graph Computational Problems

GLBench: A Comprehensive Benchmark for Graph with Large Language Models

Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path

Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?

Can Language Models Solve Graph Problems in Natural Language?

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

Are Large-Language Models Graph Algorithmic Reasoners?

Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs

GUNDAM: Aligning Large Language Models with Graph Understanding

LLM4DyG: Can Large Language Models Solve Problems on Dynamic Graphs?

Exploring the Potential of Large Language Models in Graph Generation

GraphWiz: An Instruction-Following Language Model for Graph Problems