Knowledge-Enhanced Relation Extraction Dataset

Yucong Lin,Hongming Xiao,Jiani Liu,Zichao Lin,Keming Lu,Feifei Wang,Wei Wei
DOI: https://doi.org/10.48550/arXiv.2210.11231
2023-04-25
Abstract:Recently, knowledge-enhanced methods leveraging auxiliary knowledge graphs have emerged in relation extraction, surpassing traditional text-based approaches. However, to our best knowledge, there is currently no public dataset available that encompasses both evidence sentences and knowledge graphs for knowledge-enhanced relation extraction. To address this gap, we introduce the Knowledge-Enhanced Relation Extraction Dataset (KERED). KERED annotates each sentence with a relational fact, and it provides knowledge context for entities through entity linking. Using our curated dataset, We compared contemporary relation extraction methods under two prevalent task settings: sentence-level and bag-level. The experimental result shows the knowledge graphs provided by KERED can support knowledge-enhanced relation extraction methods. We believe that KERED offers high-quality relation extraction datasets with corresponding knowledge graphs for evaluating the performance of knowledge-enhanced relation extraction methods. Our dataset is available at: \url{<a class="link-external link-https" href="https://figshare.com/projects/KERED/134459" rel="external noopener nofollow">this https URL</a>}
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the lack of a public dataset that combines knowledge graphs (KG) in the existing relation extraction (RE) tasks. Specifically: 1. **Problem Background**: - Existing knowledge - enhanced relation extraction methods that utilize auxiliary knowledge graphs have surpassed traditional text - based methods. - However, currently, there is no public dataset that contains both evidence sentences and knowledge graphs for training and evaluating knowledge - enhanced relation extraction methods. 2. **Specific Problems**: - The lack of a standardized benchmark dataset makes it difficult for researchers to report reproducible results or compare the performance of existing methods. - Previous researchers usually need to construct auxiliary knowledge graphs by themselves, create datasets, and retest previous benchmarks for a fair comparison. 3. **Solutions**: - The paper introduces the "Knowledge - Enhanced Relation Extraction Dataset" (KERED), aiming to fill this gap. - KERED improves three widely - used RE datasets (NYT10m, Wiki20m, and Wiki80) and constructs auxiliary knowledge graphs for these datasets. - Through entity linking and data refinement, KERED provides high - quality relation extraction datasets and their corresponding KGs to evaluate the performance of knowledge - enhanced relation extraction methods. 4. **Contributions**: - Developed KERED, including three challenging RE datasets and their auxiliary KGs, which is expected to promote the development of knowledge - enhanced relation extraction research. - Established evaluation metrics for knowledge - enhanced relation extraction methods on KERED and used these datasets to evaluate the state - of - the - art RE methods. - Experimental results show that the information from the auxiliary KG has a positive impact on relation extraction methods. ### Formula Explanation The formulas involved in the paper are mainly used to evaluate experimental results, ensuring the correctness and readability of the formulas. The following are the key formulas: - **Micro F1**: \[ F1=\frac{2\times\text{precision}\times\text{recall}}{\text{precision}+\text{recall}} \] where, \[ \text{precision}=\frac{TP}{TP + FP},\quad\text{recall}=\frac{TP}{TP + FN} \] \(TP\) is the global true positive rate, \(FP\) is the global false positive rate, and \(FN\) is the global false negative rate. - **Micro AP (Average Precision)**: \[ AP=\sum_{i = 2}^{n}\text{precision}_i\times(\text{recall}_i-\text{recall}_{i - 1}) \] where, \(\text{precision}_i\) and \(\text{recall}_i\) represent the global precision and recall rate at the \(i\) - th threshold respectively, and \(n\) represents the total number of samples. Through these improvements and evaluations, the paper provides important resources and benchmarks for knowledge - enhanced relation extraction, promoting further development in this field.