Advancing Topic Segmentation and Outline Generation in Chinese Texts: The Paragraph-level Topic Representation, Corpus, and Benchmark

Feng Jiang,Weihao Liu,Xiaomin Chu,Peifeng Li,Qiaoming Zhu,Haizhou Li
DOI: https://doi.org/10.48550/arXiv.2305.14790
2024-03-26
Abstract:Topic segmentation and outline generation strive to divide a document into coherent topic sections and generate corresponding subheadings, unveiling the discourse topic structure of a document. Compared with sentence-level topic structure, the paragraph-level topic structure can quickly grasp and understand the overall context of the document from a higher level, benefitting many downstream tasks such as summarization, discourse parsing, and information retrieval. However, the lack of large-scale, high-quality Chinese paragraph-level topic structure corpora restrained relative research and applications. To fill this gap, we build the Chinese paragraph-level topic representation, corpus, and benchmark in this paper. Firstly, we propose a hierarchical paragraph-level topic structure representation with three layers to guide the corpus construction. Then, we employ a two-stage man-machine collaborative annotation method to construct the largest Chinese Paragraph-level Topic Structure corpus (CPTS), achieving high quality. We also build several strong baselines, including ChatGPT, to validate the computability of CPTS on two fundamental tasks (topic segmentation and outline generation) and preliminarily verified its usefulness for the downstream task (discourse parsing).
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Lack of high - quality Chinese paragraph - level topic - structure corpora**: Compared with the research on sentence - level topic structures, the research on Chinese paragraph - level topic structures is relatively scarce, and there is a lack of large - scale, high - quality corpora, which restricts the development of related research and applications. 2. **Richness of paragraph - level topic - structure representation**: Most of the existing methods for representing paragraph - level topic structures follow the sentence - level approach, using keywords or phrases to label topic content. This method cannot fully express rich topic information at the paragraph level. 3. **Challenges in constructing large - scale high - quality corpora**: Manually constructing high - quality paragraph - level topic - structure corpora is very time - consuming and easily influenced by subjective factors. Although automatic extraction methods can construct large - scale corpora, it is difficult to ensure semantic accuracy. To solve the above problems, the author proposes the following solutions: - **Propose a hierarchical paragraph - level topic - structure representation method**: This method includes three levels, including not only paragraph boundaries and topic boundaries, but also sub - titles and document titles. In particular, complete sentences or clauses are used to represent topic content to ensure information richness. - **Adopt a two - stage human - machine collaborative annotation method**: In the first stage, the topic structure is initially extracted by a heuristic automatic extraction method. In the second stage, the correctness of the extraction results is verified manually to ensure the quality of the corpus. - **Construct a large - scale Chinese paragraph - level topic - structure corpus (CPTS)**: This corpus contains approximately 14,393 documents. After two - stage human - machine collaborative annotation, high quality is ensured (94.79% inter - annotator agreement, a Kappa value of 0.849). - **Verify the computational ability of CPTS on basic tasks**: By constructing multiple strong baseline models (including ChatGPT), the usability of CPTS in two basic tasks, namely topic segmentation and outline generation, is verified, and its effectiveness in downstream tasks (such as text parsing) is preliminarily verified. In summary, this paper aims to fill the gaps in Chinese paragraph - level topic - structure research, provide a high - quality corpus, and promote research and development in related fields.