GraphLSS: Integrating Lexical, Structural, and Semantic Features for Long Document Extractive Summarization
Margarita BugueƱo,Hazem Abou Hamdan,Gerard de Melo
2024-10-26
Abstract:Heterogeneous graph neural networks have recently gained attention for long document summarization, modeling the extraction as a node classification task. Although effective, these models often require external tools or additional machine learning models to define graph components, producing highly complex and less intuitive structures. We present GraphLSS, a heterogeneous graph construction for long document extractive summarization, incorporating Lexical, Structural, and Semantic features. It defines two levels of information (words and sentences) and four types of edges (sentence semantic similarity, sentence occurrence order, word in sentence, and word semantic similarity) without any need for auxiliary learning models. Experiments on two benchmark datasets show that GraphLSS is competitive with top-performing graph-based methods, outperforming recent non-graph models. We release our code on GitHub.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in extractive summarization of long - documents. Specifically:
1. **Complexity of graph structure**: Although existing heterogeneous graph neural networks perform well in long - document summarization tasks, they usually require external tools or additional machine - learning models to define the components of the graph, which leads to a highly complex structure and reduces the intuitiveness of the graph.
2. **Consistency of label generation**: When generating labels for extractive summaries, different studies have adopted different strategies, which affects the evaluation of model performance. The paper points out that the inconsistency of label - generation methods is an important factor that has been overlooked.
3. **Imbalanced data sets**: In the extractive - summarization task of long - documents, the number of relevant and non - relevant sentences is usually highly imbalanced, which poses a challenge to model training.
To address these challenges, the paper proposes **GraphLSS**, a new heterogeneous - graph construction method that combines lexical, structural and semantic features to generate an efficient graph structure without the need for an auxiliary - learning model. The specific contributions are as follows:
- **New heterogeneous - graph construction**: By using lexical, structural and semantic features, two types of nodes (sentences and words) and four types of edges (sentence order, sentence - semantic similarity, the relationship of words in a sentence, semantic similarity between word pairs) are defined.
- **Advanced experimental results**: Experiments on two benchmark data sets (PubMed and arXiv) show that the performance of GraphLSS is comparable to that of top - graph - based methods and even exceeds that of recent non - graph models.
- **Code sharing**: The authors have released the code on GitHub, including the label - extraction and graph - data - creation processes, to promote reproducibility and cooperation.
Through these innovations, GraphLSS not only simplifies the construction of the graph structure, improves the intuitiveness and interpretability of the model, but also effectively identifies relevant sentences in highly imbalanced data sets, thereby enhancing the performance of extractive summarization of long - documents.