Abstract:Topic segmentation is critical for obtaining structured documents and improving downstream tasks such as information retrieval. Due to its ability of automatically exploring clues of topic shift from abundant labeled data, recent supervised neural models have greatly promoted the development of long document topic segmentation, but leaving the deeper relationship between coherence and topic segmentation underexplored. Therefore, this paper enhances the ability of supervised models to capture coherence from both logical structure and semantic similarity perspectives to further improve the topic segmentation performance, proposing Topic-aware Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL). Specifically, the TSSP task is proposed to force the model to comprehend structural information by learning the original relations between adjacent sentences in a disarrayed document, which is constructed by jointly disrupting the original document at topic and sentence levels. Moreover, we utilize inter- and intra-topic information to construct contrastive samples and design the CSSL objective to ensure that the sentences representations in the same topic have higher similarity, while those in different topics are less similar. Extensive experiments show that the Longformer with our approach significantly outperforms old state-of-the-art (SOTA) methods. Our approach improve $F_1$ of old SOTA by 3.42 (73.74 -> 77.16) and reduces $P_k$ by 1.11 points (15.0 -> 13.89) on WIKI-727K and achieves an average relative reduction of 4.3% on $P_k$ on WikiSection. The average relative $P_k$ drop of 8.38% on two out-of-domain datasets also demonstrates the robustness of our approach.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve model performance by enhancing coherence modeling in topic segmentation of long - documents. Specifically, although existing supervised neural models have made significant progress in leveraging large amounts of labeled data, they still have deficiencies in exploring the deep relationship between coherence and topic segmentation. Therefore, this paper proposes two auxiliary tasks - Topic - Sensing Sentence Structure Prediction (TSSP) and Contrastive Semantic Similarity Learning (CSSL) - to enhance the ability of supervised models to capture coherence from the perspectives of logical structure and semantic similarity, thereby further improving the performance of topic segmentation.
### Background of the Paper and Problem Definition
**Topic Segmentation** refers to automatically dividing text into non - overlapping, topically - coherent paragraphs, which is crucial for improving the readability and comprehensibility of documents and also plays a key role in many downstream tasks, such as information retrieval, information extraction, and document summarization. According to the definition of a topic, each sentence should be related to the central idea of the topic it belongs to, and sentences from different topics should be distinguishable. Therefore, adjacent sentences within the same topic are more similar than sentences from different topics.
### Limitations of Existing Methods
- **Unsupervised Methods**: They mainly infer topic boundaries by calculating text similarity or exploring topic representations of text, but these methods usually rely on shallow features.
- **Supervised Neural Models**: They can model deeper - level semantic information and mine clues of topic transitions from labeled data, but their utilization of context information is still limited, especially when dealing with long documents.
### Main Contributions of the Paper
1. **Research on Supervised Topic Segmentation of Long - Documents**: Confirmed the necessity of using longer - context information.
2. **Propose Two New Auxiliary Tasks**:
- **TSSP**: By constructing unordered and incoherent documents, enhance the model's ability to learn sentence - pair structural information.
- **CSSL**: By adjusting sentence representations, ensure that sentences within the same topic have higher semantic similarity, while sentences from different topics have lower similarity.
3. **Experimental Verification**: These two tasks significantly improve the performance of topic segmentation on multiple benchmark datasets, and the performance on out - of - domain data also proves the generalization ability of the model.
### Method Overview
- **Baseline Model**: Consider topic segmentation as a sentence - level sequence - labeling task and use a pre - trained language model (such as BERT) for encoding.
- **TSSP Module**: Construct unordered documents through data - augmentation techniques, enabling the model to learn the structural relationships between sentence pairs.
- **CSSL Module**: Build positive and negative sample pairs and use contrastive learning to adjust sentence representations so that they can better reflect semantic similarity.
### Experimental Results
- **In - Domain Performance**: On the WikiSection and WIKI - 727K datasets, the model combining TSSP and CSSL is significantly superior to the baseline model and other auxiliary tasks.
- **Out - - of - Domain Performance**: On the WIKI - 50 and Elements datasets, the model also shows good generalization ability.
In conclusion, by introducing two auxiliary tasks, TSSP and CSSL, this paper enhances the coherence - modeling ability of supervised models from the perspectives of logical structure and semantic similarity, thereby achieving a significant performance improvement in the topic - segmentation task of long - documents.