A Chinese-Uighur Comparable Corpus

Feng Tao,Li Miao,Cao Yichao,Zeng Weihui
DOI: https://doi.org/10.11922/csdata.2019.0010.zh
2020-01-01
Abstract:The construction of corpus is an important task in natural language processing. However, the scale of parallel corpus does not meet the actual needs, especially in Uighur information processing. Therefore, the work of obtaining Chinese-Uighur corpus from the Internet plays an important role in the construction of Chinese-Uighur bilingual resources and the promotion of ethnic exchanges. This paper studies and designs a Chinese-Uighur comparable corpus mining system in view of the complex Uyghur and the great differences between Chinese and Uighur language forms. This system mainly includes web content extraction, acquisition of candidate comparable corpora and the cross-language similarity calculation. At present, there are more than 5000 comparable Chinese and Uygur texts, mainly in the field of news and government documents. The corpus plays an important role in the analysis and teaching of minority languages and Chinese-Uygur machine translation. For the convenience of use, this data set has further processed and normalized Chinese and Uighur language.
What problem does this paper attempt to address?