Domain Term Extraction Method Based on Hierarchical Combination Strategy for Chinese Web Documents
Yangyi Dong,Weihua Li,Hui Yu
DOI: https://doi.org/10.3969/j.issn.1000-2758.2017.04.026
2017-01-01
Abstract:Chinese domain term extraction is an important content of text knowledge mining. Chinese domain term extraction method with the traditional manual method, this method is time?consuming and laborious. It is currently in Chinese domain term extraction method of automation stage are:dictionary based method, rule?based method and statistical based method. Due to the complexity of Chinese natural language, the automatic extraction method has some limitations, such as the specific areas of the user dictionary and rule updating speed is slow, lack of consider?ation of text feature, which leads to the extraction performance is poor. To solve these problems, this paper presents Chinese domain term extraction methods that compound the text feature and statistics. After coarse grain screening of Chinese words in a document, the method considering the part of speech, word length, boundary text features of the candidate terms, construct information entropy and TFIDF statistics, calculate the comprehensive weight, and the weights are bigger than the set threshold extracted as the final domain terms. The experimental results show that the method gets the good correct rate, recall rate and F?measure under the test corpus.