CNA: A Dataset for Parsing Discourse Structure on Chinese News Articles.
Zhenliang Guo,Zhen Huang,Yong Dou,Xiubin Yu,Sijie Wang,Zhongwu Chen,Xinxin Su,Xiaohang Liu
DOI: https://doi.org/10.1109/ictai56018.2022.00151
2022-01-01
Abstract:Discourse structure analysis has shown to be useful for many artificial intelligence (AI) tasks such as text summarization and text categorization. However, for the Chinese news domain, the discourse structure analysis system is still immature due to the limitation of the lack of expert-annotated datasets. In this paper, we present CNA, a Chinese news corpus containing 1155 news articles annotated by human experts, which covers four domains and four news media sources. Next, we implement several text classification methods as baselines. Experimental results demonstrate that document-level method can achieve a better performance, and we further propose a document-level neural network model with multiple sentence features which achieves the state-of-the-art performance. In the end, we analyze the content type distribution of each sentence in CNA and the prediction errors of our model that occurred on the test set. The codes and dataset will be open-sourced at https://github.com/gzl98/Chinese Discourse Profiling.