Keyword Extraction Based on Statistical Information for Cyrillic Mongolian Script

Bat-Erdene Nyandag,Ru Li,Orgil Demberel
DOI: https://doi.org/10.1109/ccdc.2017.7978889
2017-01-01
Abstract:We present a keyword extraction system for Mongolian documents using word co-occurrence statistical information which used in for English, Chinese and other languages. This method based on extracting top frequent words and building the co-occurrence matrix showing the occurrence of each frequent word. The biasness degree of the words and the set of frequent words are measured using CHI-Square Method (χ2). Also, the weight of the words and the set of frequent words are measured using word frequency - inverted word frequency (WF-IWF). Therefore words with high χ2 values and high WF-IWF values are likely to be keywords. The adopted χ2 method in this study is compared with another one method based on WF-IWF which tested for Mongolian. Two different documents were used to evaluate the system performance. We evaluate the effectiveness of χ2 method and WF-IWF method. Results show that the χ2 method is better than WF-IWF.
What problem does this paper attempt to address?