Benchmarking the Abilities of Large Language Models for RDF Knowledge Graph Creation and Comprehension: How Well Do LLMs Speak Turtle?

Johannes Frey,Lars-Peter Meyer,Natanael Arndt,Felix Brei,Kirill Bulert
DOI: https://doi.org/10.48550/arXiv.2309.17122
2023-09-29
Abstract:Large Language Models (LLMs) are advancing at a rapid pace, with significant improvements at natural language processing and coding tasks. Yet, their ability to work with formal languages representing data, specifically within the realm of knowledge graph engineering, remains under-investigated. To evaluate the proficiency of various LLMs, we created a set of five tasks that probe their ability to parse, understand, analyze, and create knowledge graphs serialized in Turtle syntax. These tasks, each embodying distinct degrees of complexity and being able to scale with the size of the problem, have been integrated into our automated evaluation system, the LLM-KG-Bench. The evaluation encompassed four commercially available LLMs - GPT-3.5, GPT-4, Claude 1.3, and Claude 2.0, as well as two freely accessible offline models, GPT4All Vicuna and GPT4All Falcon 13B. This analysis offers an in-depth understanding of the strengths and shortcomings of LLMs in relation to their application within RDF knowledge graph engineering workflows utilizing Turtle representation. While our findings show that the latest commercial models outperform their forerunners in terms of proficiency with the Turtle language, they also reveal an apparent weakness. These models fall short when it comes to adhering strictly to the output formatting constraints, a crucial requirement in this context.
Artificial Intelligence,Computation and Language,Databases
What problem does this paper attempt to address?
This paper aims to evaluate the capabilities of large - language models (LLMs) in knowledge graph engineering, especially their performance in handling the Turtle serialization format of RDF (Resource Description Framework) knowledge graphs. Specifically, the paper creates a set of five tasks to explore these models' abilities in parsing, understanding, analyzing, and creating knowledge graphs represented in the Turtle format. These tasks cover different levels of complexity and can be adjusted according to the size of the problem. Through these tasks, researchers hope to understand the specific advantages and disadvantages of different LLMs in using Turtle representation in the RDF knowledge graph engineering workflow. The paper selects four commercially available LLMs (GPT - 3.5, GPT - 4, Claude 1.3, and Claude 2.0), as well as two freely available offline models (GPT4All Vicuna and GPT4All Falcon 13B) for evaluation. The evaluation results not only show the progress of the latest commercial models in using the Turtle language but also reveal their obvious weaknesses in strictly adhering to output format constraints. This finding is of great significance for knowledge graph engineering applications that require high precision and format consistency.