Matching Tabular Data to Knowledge Graph Based on Multi-level Scoring Filters for Table Entity Disambiguation

Xinhe Li,Chenghuan Jiang,Peng Wang
DOI: https://doi.org/10.1007/978-981-97-7235-3_15
2024-01-01
Abstract:Tabular data to knowledge graph matching (TDKGM) aims to assign semantic tags from knowledge graphs (KGs) to the elements of the tables, including three tasks: Column Type Annotation (CTA), Cell Entity Annotation (CEA), and Columns Property Annotation (CPA). It is a non-trivial task due to missing, incomplete, or ambiguous metadata, which makes entity disambiguation more difficult. Previous approaches mostly are based on two representative paradigms: heuristic-based and deep learning-based methods. However, the former is less robust when tackling real-world Web tables, while the latter requires much training data and time. Consequently, we conceive the idea of introducing table context semantics and propose a simple yet effective annotation method MUSTED (MUlti-level Scoring filters for Table Entity Disambiguation). First, we preprocess the tabular data via table structure analysis, spell correction, and entity recall. Then, we assign scores to the candidate entities of the cells, based on the similarities between table cells and property values in KGs. After that, we determine the subject column and find property-based matches. Finally, we complete three semantic annotation tasks based on scores from filter-centered disambiguation without the help of non-table information (e.g., table headers and table names). Experimental results on public datasets demonstrate that MUSTED can disambiguate entity mentions and significantly outperform the strong baselines at F1-score with less time and memory.
What problem does this paper attempt to address?