NLPCC 2016 Shared Task Chinese Words Similarity Measure via Ensemble Learning based on Multiple Resources

Shutian Ma, Xiaoyong Zhang, Chengzhi Zhang
DOI: https://doi.org/10.1007/978-3-319-50496-4_79
2016-01-01
Abstract:Many Chinese words similarity measure algorithms have been introduced since it's a fundamental issue in various tasks of natural language processing. Previous work focused mainly on using existing semantic knowledge bases or large-scale corpora. However, knowledge base and corpus have limitations for broad coverage and data update. Thus, ensemble learning is then used to improve performance by combing similarities. This paper describes a Chinese word similarity measure using ensemble learning of knowledge and corpus-based algorithms. To be specific, knowledge-based methods are based on TYCCL and Hownet. Two corpus-based methods compute similarities via retrieving on web search engines and deep learning on large-scale corpora (news and microblog). All similarities are combined through support vector regression to get final similarity. Evaluation suggests that TYCCL-based method behaves best according to testing dataset. However, if tuning parameters appropriately, ensemble learning could outperform all the other algorithms. Besides, deep learning on news corpora is better than other corpus-based methods.
What problem does this paper attempt to address?