ON AN IMPROVED NAVE BAYESIAN KEYWORD EXTRACTION ALGORITHM

Jinbo Wang,Lianzhi Wang,Wanlin Gao,Jian Yu
DOI: https://doi.org/10.3969/j.issn.1000-386x.2014.02.047
2014-01-01
Abstract:In order to improve the keyword extraction accuracy,based on recognising the compound by using co-occurrence frequency of the words before and after the identical words in text,we propose a nave Bayesian keyword extraction algorithm which is based on the improvement of statistical characteristics of words and expressions.The algorithm selects the word length,the part of speech,the position and the TF-IDF value of the words and expressions as the feature items of the words and expressions,improves the method of counting the word length,TF-IDF and word frequency,makes those words with longer length and higher TF-IDF value have higher probability.While counting the word frequency,it considers the relationship of containing and to be contained between the words.Then,it uses nave Bayesian model to train the texts with the keywords marked and to get the occurrence probability of each feature item for extracting the keywords of text. According to the experiment,the keywords extracted by the algorithm in this paper have a higher precision rate and readability than by the traditional word frequency-based and decision tree C4.5-based keyword extraction algorithms.
What problem does this paper attempt to address?