A New Full-Text Indexing Model with Low Space Overhead for Chinese Text Retrieval

Shuigeng Zhou,Jihong Guan
DOI: https://doi.org/10.1007/s00799-004-0089-5
2004-01-01
International Journal on Digital Libraries
Abstract:Text retrieval systems require an index to allow efficient retrieval of documents at the cost of some storage overhead. This paper proposes a novel full-text indexing model for Chinese text retrieval based on the concept of adjacency matrix of directed graph. Using this indexing model, on one hand, retrieval systems need to keep only the indexing data, instead of the indexing data and the original text data as the traditional retrieval systems always do. On the other hand, occurrences of index term are identified by labels of the so-called s-strings where the index term appears, rather than by its positions as in traditional indexing models. Consequently, system space cost as a whole can be reduced drastically while retrieval efficiency is maintained satisfactory. Experiments over several real-world Chinese text collections are carried out to demonstrate the effectiveness and efficiency of this model. In addition to Chinese, The proposed indexing model is also effective and efficient for text retrieval of other Oriental languages, such as Japanese and Korean. It is especially useful for digital library application areas where storage resource is very limited (e.g., e-books and CD-based text retrieval systems).
What problem does this paper attempt to address?