Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test Data

Antonio De Santis,Marco Balduini,Federico De Santis,Andrea Proia,Arsenio Leo,Marco Brambilla,Emanuele Della Valle
2024-08-03
Abstract:Aerospace manufacturing companies, such as Thales Alenia Space, design, develop, integrate, verify, and validate products characterized by high complexity and low volume. They carefully document all phases for each product but analyses across products are challenging due to the heterogeneity and unstructured nature of the data in documents. In this paper, we propose a hybrid methodology that leverages Knowledge Graphs (KGs) in conjunction with Large Language Models (LLMs) to extract and validate data contained in these documents. We consider a case study focused on test data related to electronic boards for satellites. To do so, we extend the Semantic Sensor Network ontology. We store the metadata of the reports in a KG, while the actual test results are stored in parquet accessible via a Virtual Knowledge Graph. The validation process is managed using an LLM-based approach. We also conduct a benchmarking study to evaluate the performance of state-of-the-art LLMs in executing this task. Finally, we analyze the costs and benefits of automating preexisting processes of manual data extraction and validation for subsequent cross-report analyses.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the challenging issue of test data extraction and validation in the aerospace manufacturing industry. Specifically, it focuses on how to effectively extract and validate test data from relevant test reports of satellite electronic boards (particularly printed circuit boards, PCBs). ### Main Problems Addressed by the Paper 1. **Challenges in Data Extraction**: Due to the highly fragmented, heterogeneous, and unstructured nature of test reports (mainly in .docx and .pdf formats), manually processing these documents to extract test data is very time-consuming and prone to errors. 2. **Difficulties in Data Validation**: The existing process of validating test results primarily relies on manual execution, which is not only costly but also inefficient. The challenge in automating this process lies in the high heterogeneity of the data, making traditional regular expression-based validation methods inadequate to handle such complexity. ### Proposed Method The paper proposes a hybrid approach that combines Large Language Models (LLMs) and Knowledge Graphs (KGs) technologies to address the above issues: 1. **Utilizing Knowledge Graphs (KGs)**: Create a knowledge graph extended from the Semantic Sensor Network (SSN) ontology to capture the semantics of the data and manage structural heterogeneity. This allows the metadata of the test reports to be stored in the knowledge graph, while the actual test results are stored in structured data storage. 2. **Using Large Language Models (LLMs) for Validation**: Leverage the powerful capabilities of LLMs to automatically validate the consistency of test data, effectively handling the syntactic and structural heterogeneity of the data. This way, data engineers can focus on the data points flagged as anomalies by the LLMs, significantly reducing their workload. 3. **Virtual Knowledge Graph (VKG) for Data Access**: Construct a virtual knowledge graph to map data storage to the ontology, allowing users to directly access validated test data through SPARQL queries, further simplifying the data integration and analysis process. Through this approach, the paper aims to increase the automation of test data extraction and validation, thereby reducing human errors, speeding up data analysis, and ultimately improving production efficiency and product quality.