Automatic knowledge-graph creation from historical documents: The Chilean dictatorship as a case study

Camila Díaz,Jocelyn Dunstan,Lorena Etcheverry,Antonia Fonck,Alejandro Grez,Domingo Mery,Juan Reutter,Hugo Rojas
2024-08-22
Abstract:We present our results regarding the automatic construction of a knowledge graph from historical documents related to the Chilean dictatorship period (1973-1990). Our approach consists on using LLMs to automatically recognize entities and relations between these entities, and also to perform resolution between these sets of values. In order to prevent hallucination, the interaction with the LLM is grounded in a simple ontology with 4 types of entities and 7 types of relations. To evaluate our architecture, we use a gold standard graph constructed using a small subset of the documents, and compare this to the graph obtained from our approach when processing the same set of documents. Results show that the automatic construction manages to recognize a good portion of all the entities in the gold standard, and that those not recognized are mostly explained by the level of granularity in which the information is structured in the graph, and not because the automatic approach misses an important entity in the graph. Looking forward, we expect this report will encourage work on other similar projects focused on enhancing research in humanities and social science, but we remark that better evaluation metrics are needed in order to accurately fine-tune these types of architectures.
Computers and Society
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to automatically construct a knowledge graph from a large number of historical documents related to the Chilean dictatorship period (1973 - 1990). Specifically, the researchers hope to use large - language models (LLMs) to automatically identify entities and the relationships between them, and reduce the hallucination phenomenon of the model through the guidance of a simple ontology. In addition, they also hope that through this method, they can better integrate and analyze the information in these historical documents, thereby supporting the research on this important historical event. ### Research Background Knowledge graphs have great potential in analyzing historical documents. By constructing a knowledge graph, the focus can be shifted from document - centered to entity - centered, enabling users to more conveniently find relevant entities and their associated information. However, constructing a knowledge graph is a time - consuming and costly task, which requires reading and organizing all relevant documents to ensure the accuracy of entities and relationships. ### Main Challenges 1. **Entity Recognition**: Accurately identify all relevant entities from a large number of historical documents. 2. **Relationship Extraction**: Determine the relationships between these entities. 3. **Avoid Hallucination**: Prevent LLMs from generating inaccurate or non - existent information. 4. **Evaluate Quality**: Ensure that the quality of the automatically generated knowledge graph meets the standards. ### Solutions To address the above challenges, the researchers proposed a method for automatically constructing a knowledge graph based on LLMs. The specific steps are as follows: 1. **Use Simple Ontology**: Define four types of entities (individuals, events, locations, organizations) and seven types of relationships (such as the relationship between individuals and organizations, the relationship between organizations and events, etc.) to guide LLMs in entity and relationship extraction. 2. **Zero - sample Prompting**: Send specific prompts to LLMs through OpenAI's API to enable them to identify entities and relationships in the documents. 3. **Entity Resolution**: Remove duplicate entities and correct possible errors. 4. **Graph Post - processing**: Remove redundant and incorrect edges (relationships), and further optimize the graph structure by merging redundant nodes. ### Evaluation Methods To verify the effectiveness of this method, the researchers used a standard graph constructed by domain experts as a benchmark and compared the differences between the automatically generated graph and the standard graph. The results show that in terms of individual recognition, this method performs excellently; while in the recognition of organizations, events, and locations, although there are some deviations, it can still capture the main information overall. ### Future Work The researchers plan to further improve the prompting strategy to improve the recognition accuracy of different types of entities and relationships. In addition, they will also create a labeled corpus to systematically evaluate the performance of tools such as chatGPT in entity recognition and explore other possible methods, such as named - entity recognition algorithms. In summary, this research aims to provide strong support for the historical research of the Chilean dictatorship period through automated means, fill in information gaps, enhance context understanding, and reveal more in - depth connections and patterns.