Open Domain Knowledge Extraction for Knowledge Graphs

Kun Qian,Anton Belyi,Fei Wu,Samira Khorshidi,Azadeh Nikfarjam,Rahul Khot,Yisi Sang,Katherine Luna,Xianqi Chu,Eric Choi,Yash Govind,Chloe Seivwright,Yiwen Sun,Ahmed Fakhry,Theo Rekatsinas,Ihab Ilyas,Xiaoguang Qi,Yunyao Li
2023-10-31
Abstract:The quality of a knowledge graph directly impacts the quality of downstream applications (e.g. the number of answerable questions using the graph). One ongoing challenge when building a knowledge graph is to ensure completeness and freshness of the graph's entities and facts. In this paper, we introduce ODKE, a scalable and extensible framework that sources high-quality entities and facts from open web at scale. ODKE utilizes a wide range of extraction models and supports both streaming and batch processing at different latency. We reflect on the challenges and design decisions made and share lessons learned when building and deploying ODKE to grow an industry-scale open domain knowledge graph.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the challenges of ensuring the completeness and timeliness of knowledge graphs during their construction. Specifically, the paper proposes a scalable framework called ODK (Open Domain Knowledge Extraction), which can efficiently extract high-quality entities and facts from the open web to enhance the content coverage and freshness of knowledge graphs. The paper points out that traditional methods of constructing knowledge graphs rely on manual review, which is both time-consuming and costly, and difficult to scale to large datasets. Therefore, researchers have developed the ODK automated framework to continuously update the facts in knowledge graphs to maintain their completeness and timeliness. The design of ODK takes into account the following key challenges: 1. **Large Data Volume**: The amount of data and facts on the web is enormous and constantly growing, requiring the processing of web-scale data. 2. **Data and Task Diversity**: The web contains various types of data, including plain text, semi-structured data, etc. To extract high-quality facts from these sources, multiple types of extractors are needed. 3. **High Accuracy**: Information on the web may contain errors or conflicting facts, such as different statements about a person's height. Additionally, some facts change over time, so it is necessary to identify the most accurate and up-to-date facts. 4. **Timeliness**: Timely extraction of new knowledge from the web and its incorporation into the knowledge graph is crucial for many downstream applications. To address the above challenges, ODK supports both streaming and batch processing by adopting a series of extraction models, meeting the needs of different latency scenarios. Additionally, it addresses issues such as multilingual support and link inference, and supports streaming processing mode, improving the system's scalability and data freshness. In summary, the goal of this paper is to address the shortcomings in the existing knowledge graph construction process by proposing the ODK framework, particularly in terms of data scale, diversity, accuracy, and timeliness, thereby enhancing the quality and usability of knowledge graphs.