Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?

Johannes Frey,Lars-Peter Meyer,Natanael Arndt,Felix Brei,Kirill Bulert

DOI: https://doi.org/10.48550/arXiv.2309.17122

2023-09-29

Abstract:Large Language Models (LLMs) are advancing at a rapid pace, with significant improvements at natural language processing and coding tasks. Yet, their ability to work with formal languages representing data, specifically within the realm of knowledge graph engineering, remains under-investigated. To evaluate the proficiency of various LLMs, we created a set of five tasks that probe their ability to parse, understand, analyze, and create knowledge graphs serialized in Turtle syntax. These tasks, each embodying distinct degrees of complexity and being able to scale with the size of the problem, have been integrated into our automated evaluation system, the LLM-KG-Bench. The evaluation encompassed four commercially available LLMs - GPT-3.5, GPT-4, Claude 1.3, and Claude 2.0, as well as two freely accessible offline models, GPT4All Vicuna and GPT4All Falcon 13B. This analysis offers an in-depth understanding of the strengths and shortcomings of LLMs in relation to their application within RDF knowledge graph engineering workflows utilizing Turtle representation. While our findings show that the latest commercial models outperform their forerunners in terms of proficiency with the Turtle language, they also reveal an apparent weakness. These models fall short when it comes to adhering strictly to the output formatting constraints, a crucial requirement in this context.

Artificial Intelligence,Computation and Language,Databases

What problem does this paper attempt to address?

This paper aims to evaluate the capabilities of large - language models (LLMs) in knowledge graph engineering, especially their performance in handling the Turtle serialization format of RDF (Resource Description Framework) knowledge graphs. Specifically, the paper creates a set of five tasks to explore these models' abilities in parsing, understanding, analyzing, and creating knowledge graphs represented in the Turtle format. These tasks cover different levels of complexity and can be adjusted according to the size of the problem. Through these tasks, researchers hope to understand the specific advantages and disadvantages of different LLMs in using Turtle representation in the RDF knowledge graph engineering workflow. The paper selects four commercially available LLMs (GPT - 3.5, GPT - 4, Claude 1.3, and Claude 2.0), as well as two freely available offline models (GPT4All Vicuna and GPT4All Falcon 13B) for evaluation. The evaluation results not only show the progress of the latest commercial models in using the Turtle language but also reveal their obvious weaknesses in strictly adhering to output format constraints. This finding is of great significance for knowledge graph engineering applications that require high precision and format consistency.

Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?

Assessing SPARQL capabilities of Large Language Models

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

Large Language Models and Knowledge Graphs: Opportunities and Challenges

GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets

NLPBench: Evaluating Large Language Models on Solving NLP Problems

Generating Knowledge Graphs from Large Language Models: A Comparative Study of GPT-4, LLaMA 2, and BERT

Towards Evaluating Large Language Models for Graph Query Generation

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Rethinking Language Models as Symbolic Knowledge Graphs

Evaluating and Enhancing Large Language Models for Conversational Reasoning on Knowledge Graphs

STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis

LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and the Importance of Object-based Representations

GPT4Graph: Can Large Language Models Understand Graph Structured Data ? an Empirical Evaluation and Benchmarking.

GraphArena: Benchmarking Large Language Models on Graph Computational Problems

Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models

PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models

LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension