Extracting Summary Knowledge Graphs from Long Documents

Zeqiu Wu,Rik Koncel-Kedziorski,Mari Ostendorf,Hannaneh Hajishirzi
DOI: https://doi.org/10.48550/arXiv.2009.09162
2021-06-14
Abstract:Knowledge graphs capture entities and relations from long documents and can facilitate reasoning in many downstream applications. Extracting compact knowledge graphs containing only salient entities and relations is important but challenging for understanding and summarizing long documents. We introduce a new text-to-graph task of predicting summarized knowledge graphs from long documents. We develop a dataset of 200k document/graph pairs using automatic and human annotations. We also develop strong baselines for this task based on graph learning and text summarization, and provide quantitative and qualitative studies of their effect.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to extract a compact knowledge graph from long documents (such as scientific papers) to represent its most important information. Specifically, the goals of the paper are: 1. **Identify key entities and relationships**: Find the most important and relevant entities from long documents and the relationships between them, so as to construct a compact knowledge graph that can reflect the core idea of the document. 2. **Improve the ability to understand and summarize long documents**: By extracting these key pieces of information, help to better understand the content of long documents and generate concise summaries. 3. **Meet the challenges of large - scale data processing**: When dealing with long and dense documents such as scientific papers, traditional information extraction methods may extract hundreds or thousands of entities and relationships, which makes it a new challenge to determine which are the most important and representative entities and relationships. The paper proposes a new text - to - graph task, aiming to predict the summary knowledge graph extracted from long documents. To this end, the author has developed a data set containing 200,000 document/graph pairs, and developed powerful baseline models based on graph learning and text summarization techniques. In addition, quantitative and qualitative studies have been carried out to evaluate the effectiveness of these models. ### Formula and symbol description - \(D\) represents the input document. - \(T_v\) represents the set of predefined entity types. - \(T_R\) represents the set of predefined relationship types. - \(G=(V, E)\) represents the predicted summary knowledge graph, where: - \(V\) is the set of entity nodes, and each \(v_i\in V\) represents an important entity with an entity type \(t_i\in T_v\). - \(E\) is the set of edges, and each edge \((v_i, v_j, r_{ij}^k)\in E\) represents an important relationship from \(v_i\) to \(v_j\) with a relationship type \(r_{ij}^k\in T_R\). ### Main contributions 1. **Introduce a new task**: Propose a new task of extracting a summary knowledge graph from long documents. 2. **Construct a large - scale data set**: Develop a data set containing 200,000 documents and their corresponding knowledge graphs. 3. **Develop baseline models**: Develop two baseline models based on text summarization and graph learning techniques, and evaluate their effectiveness. 4. **Evaluation metrics**: Design metrics for evaluating entity salience, relationship salience, and entity repetition rate to ensure the quality of model output. Through these efforts, the paper hopes to promote the research of future models, enabling them to better capture complex text relationships and be applied to a variety of downstream tasks.