Chinese Information Retrieval: Using Characters Or Words?

Jian-yun Nie,Fuji Ren
DOI: https://doi.org/10.1016/S0306-4573(98)00051-X
IF: 7.466
1999-01-01
Information Processing & Management
Abstract:Several experimental studies have been conducted in order to compare words and n-grams with respect to their performances in Chinese Information Retrieval (IR). These studies claim that n-grams (in particular bigrams) perform as well as, or even better than, words. In this paper, we propose a relaxed segmentation process for Chinese which extracts not only the longest words, but also all the short words implied. Special rules are also designed to recognize and normalize special words such as proper names and nominal pre-determiners. Our experiments show that IR based on this segmentation gives a slightly higher effectiveness than bigrams. In addition, it requires less time and space for document and query processing. We also tested combinations of words with bigrams in IR and using top-ranked documents for query expansion. These techniques proved to be effective. (C) 1999 Elsevier Science Ltd. All rights reserved.
What problem does this paper attempt to address?