MCER: A Multi-domain Dataset for Sentence-Level Chinese Ellipsis Resolution

Qi Jialu,Shao Yanqiu,Li Wei,Shen Zizhuo
DOI: https://doi.org/10.1007/978-3-031-17120-8_3
2022-01-01
Abstract:Ellipsis is a cross-linguistic phenomenon which can be commonly seen in Chinese. Although eliding some of the elements in the sentence that could be understood from the context makes no difference for human beings, it is a great challenge for machine in the procedure of natural language understanding. In order to promote ellipsis-related researches in Chinese language, we propose an application-oriented definition of ellipsis specifically for researches in the realm of Chinese natural language processing. At the same time, we build and release a multi-domain dataset for sentence-level Chinese ellipsis resolution following the new definition we propose. In addition, we define a new task: sentence-level Chinese ellipsis resolution, and model it with two subprocedures: 1) Elliptic position detection; 2) Ellipsis resolution. We propose several baseline methods based on pre-trained language models, as they have obtained state-of-the-art results on related tasks. Besides, it is also worth noticing that, to our knowledge, this is the first study that apply the extractive method for question answering to Chinese ellipsis resolution. The results of the experiments show that it is possible for machine to understand ellipsis within our new definition.
What problem does this paper attempt to address?