IDEL: In-Database Entity Linking with Neural Embeddings

Torsten Kilias,Alexander Löser,Felix A. Gers,Richard Koopmanschap,Ying Zhang,Martin Kersten
DOI: https://doi.org/10.48550/arXiv.1803.04884
2018-03-13
Abstract:We present a novel architecture, In-Database Entity Linking (IDEL), in which we integrate the analytics-optimized RDBMS MonetDB with neural text mining abilities. Our system design abstracts core tasks of most neural entity linking systems for MonetDB. To the best of our knowledge, this is the first defacto implemented system integrating entity-linking in a database. We leverage the ability of MonetDB to support in-database-analytics with user defined functions (UDFs) implemented in Python. These functions call machine learning libraries for neural text mining, such as TensorFlow. The system achieves zero cost for data shipping and transformation by utilizing MonetDB's ability to embed Python processes in the database kernel and exchange data in NumPy arrays. IDEL represents text and relational data in a joint vector space with neural embeddings and can compensate errors with ambiguous entity representations. For detecting matching entities, we propose a novel similarity function based on joint neural embeddings which are learned via minimizing pairwise contrastive ranking loss. This function utilizes a high dimensional index structures for fast retrieval of matching entities. Our first implementation and experiments using the WebNLG corpus show the effectiveness and the potentials of IDEL.
Databases,Computation and Language,Neural and Evolutionary Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform entity linking (EL) between relational databases and text data. Specifically, the author proposes a new architecture named IDEL (In - Database Entity Linking), aiming to combine the analytic - optimized relational database management system (RDBMS) MonetDB with neural text mining capabilities. This system design aims to abstract the core tasks of most neural entity linking systems to support MonetDB. ### Main problems 1. **Data migration cost**: Existing entity linking systems usually operate independently of RDBMS, requiring users to migrate data between different systems, which not only increases the technical burden but may also lead to data inconsistency and increased maintenance costs. 2. **Feature engineering complexity**: Traditional entity linking methods rely on manual feature engineering, which is time - consuming and requires professional knowledge. 3. **Fuzzy matching problem**: When dealing with synonyms, homophones, context - related words, and spelling mistakes, the recall and precision rates of existing methods are low. ### Solutions IDEL solves the above problems in the following ways: 1. **Integrating RDBMS and entity linking tools**: IDEL integrates relational data, text data, and entity linking tools into the same system, using MonetDB as the infrastructure. This can avoid the cost of data migration and utilize the powerful query processing and data analysis capabilities of RDBMS. 2. **Automatic feature learning**: IDEL uses neural embeddings to represent text and relational data, thereby reducing the need for manual feature engineering. The system can automatically extract useful signals from relational and text data by learning features in the hidden layer. 3. **High - dimensional index structure**: IDEL proposes a new similarity function based on joint neural embeddings, which is learned by minimizing the pairwise contrastive ranking loss. In addition, the system uses a high - dimensional index structure to quickly retrieve matching entities, improving the efficiency and accuracy of fuzzy matching. ### Technical details 1. **Vectorization**: Convert relational data and text data into vector representations, and use pre - trained models (such as SkipThought) to generate initial embeddings. 2. **Matching candidates**: Generate matching candidates through SQL queries, supporting exact matching and semantic matching strategies. 3. **Linking**: Rank and select matching candidates to generate final entity linking results. 4. **Retraining**: Retrain the neural network using matching candidates to improve the model and adapt to data changes. ### Experimental verification The author conducted a preliminary evaluation using the WebNLG dataset, and the results show that IDEL has high effectiveness and potential in handling multiple entity types and a large number of manually annotated sentences. In summary, by proposing the IDEL system, this paper aims to solve the problems of high data migration cost, complex feature engineering, and fuzzy matching in traditional entity linking methods, providing an efficient and automated solution for entity linking between relational databases and text data.