Query Based Chinese Phrase Extraction for Site Search
Jingfang Xu,Shaozhi Ye,Xing Li
DOI: https://doi.org/10.1007/978-3-540-30480-7_14
2004-01-01
Abstract:Word segmentation(WS) is one of the major issues of information processing in character-based languages: for there are no explicit. word boundaries in these languages. Moreover, a combination of multiple continuous words, a phrase, is usually a minimum meaningful unit. Although much work has been done on WS, in site web search. little has been explored to mine site-specific knowledge from user query log for both more accurate WS and better retrieval performance. This paper proposes a novel, statistics- based method to extract phrases based on user query log. The extracted phrases, combined with a general, static dictionary, construct a dynamic, site-specific dictionary. According to the dictionary, web documents are segmented into phrases and words, which are kept as separate index terms to build phrase enhanced index for site search. The experiment result shows that our approach greatly improves the retrieval performance. It also helps to detect many out-of-vocabulary words, such as site-specific phrases, newly created words and names of people and locations, which are difficult to process with a general, static dictionary.