CStory: A Chinese Large-scale News Storyline Dataset.
Kaijie Shi,Xiaozhi Wang,Jifan Yu,Lei Hou,Juanzi Li,Jingtong Wu,Dingyu Yong,Jinghui Xiao,Qun Liu
DOI: https://doi.org/10.1145/3511808.3557573
2022-01-01
Abstract:In today's massive news streams, storylines can help us discover related event pairs and understand the evolution of hot events. Hence many efforts have been devoted to automatically constructing news storylines. However, the development of these methods is strongly limited by the size and quality of existing storyline datasets since news storylines are expensive to annotate as they contain a myriad of unlabeled relationships growing quadratically with the number of news events. Working around these difficulties, we propose a sophisticated pre-processing method to filter candidate news pairs by entity co-occurrence and semantic similarity. With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11,978 news articles, 112,549 manually labeled storyline relation pairs, and 49,832 evidence sentences for annotation judgment. We conduct extensive experiments on CStory using various algorithms and find that constructing news storylines is challenging even for pre-trained language models. Empirical analysis shows that the sample unbalance issue significantly influences model performance, which shall be the focus of future works. Our dataset is now publicly available at https://github.com/THU-KEG/CStory.