Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Chang Liu,Bo Wu
2023-09-09
Abstract:Large Language Models (LLMs) have garnered considerable interest within both academic and industrial. Yet, the application of LLMs to graph data remains under-explored. In this study, we evaluate the capabilities of four LLMs in addressing several analytical problems with graph data. We employ four distinct evaluation metrics: Comprehension, Correctness, Fidelity, and Rectification. Our results show that: 1) LLMs effectively comprehend graph data in natural language and reason with graph topology. 2) GPT models can generate logical and coherent results, outperforming alternatives in correctness. 3) All examined LLMs face challenges in structural reasoning, with techniques like zero-shot chain-of-thought and few-shot prompting showing diminished efficacy. 4) GPT models often produce erroneous answers in multi-answer tasks, raising concerns in fidelity. 5) GPT models exhibit elevated confidence in their outputs, potentially hindering their rectification capacities. Notably, GPT-4 has demonstrated the capacity to rectify responses from GPT-3.5-turbo and its own previous iterations. The code is available at: <a class="link-external link-https" href="https://github.com/Ayame1006/LLMtoGraph" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate the capabilities of large language models (LLMs) in handling graph data. Although LLMs have shown outstanding performance in various fields such as natural language understanding and generation, question answering, and information retrieval, their application to graph data is still in the exploratory stage. The paper systematically evaluates the performance of four LLMs (including two open-source models and two GPT models) on graph data, revealing the strengths and weaknesses of these models in understanding, reasoning, and generating graph structures. ### Specific Research Objectives 1. **Evaluate LLMs' understanding of graph data**: - By designing a series of tasks, evaluate LLMs' understanding of graph data, including tasks such as graph connectivity, node neighbor classification, node degree, pattern matching, and shortest path. 2. **Compare the performance of different LLMs**: - Evaluate the performance differences of four LLMs on graph data, particularly comparing GPT models with other open-source models. 3. **Explore the impact of different prompting techniques on LLMs' performance**: - Use techniques such as zero-shot prompting, zero-shot chain-of-thought prompting, and few-shot prompting to evaluate their effectiveness in improving LLMs' performance. 4. **Analyze LLMs' performance in multi-answer tasks**: - Investigate the accuracy, confidence, and self-correction ability of LLMs in multi-answer tasks. ### Main Contributions 1. **Designed a set of queries to evaluate LLMs' understanding of graph data**: - Through these query sets, demonstrated LLMs' ability to extract, understand, and analyze the topological structures within graph data. 2. **Evaluated the performance of four LLMs in understanding graph data**: - Comparison results show that the two open-source models significantly underperform the two GPT models in generating correct answers. 3. **Designed specific tasks to evaluate LLMs' capabilities in graph topological reasoning**: - As the demand for graph topological reasoning increases, the accuracy of LLMs significantly decreases. 4. **Explored the effectiveness of advanced prompting techniques in graph topological reasoning**: - Found that zero-shot chain-of-thought prompting and few-shot prompting do not always improve LLMs' performance and can sometimes lead to incorrect outputs. 5. **Analyzed the performance of GPT models in multi-answer tasks**: - GPT models generate a large number of inaccurate answers in multi-answer tasks, with GPT-3.5-turbo having a higher error rate than the correct rate. 6. **Explored the confidence and self-correction ability of GPT models**: - GPT models exhibit high confidence when generating responses, which may affect their self-correction ability. GPT-4 shows stronger self-correction capabilities, able to correct errors from GPT-3.5-turbo and its own earlier iterations. ### Conclusion Through systematic experiments and evaluations, this paper reveals the potential and limitations of LLMs in handling graph data. While GPT models perform well on certain tasks, they still face challenges in multi-answer tasks and complex graph topological reasoning. Future research can further optimize LLMs' performance in processing graph data to better apply them to real-world problems.