Abstract:As Chinese is an ideographic character-based language, the words in the texts are not delimited by spaces. Indexing of Chinese documents is impossible without a proper segmentation algorithm. Many Chinese segmentation algorithms have been proposed in the past. Traditional segmentation algorithms cannot operate without a large dictionary or a large corpus of training data. Nowadays, the Web has become the largest corpus that is ideal for Chinese segmentation. Although the search engines do not segment texts into proper words, they maintain huge databases of documents and frequencies of character sequences in the documents. Their databases are important potential resources for segmentation. In this paper, we propose a segmentation algorithm by mining web data with the help from search engines. It is the first unified segmentation algorithm for Chinese language from different geographical areas. Experiments have been conducted on the datasets of a recent Chinese segmentation competition. The results show that our algorithm outperforms the traditional algorithms in terms of precision and recall. Moreover, our algorithm can effectively deal with the problem of segmentation ambiguity, new word (unknown word) detection, and stop words.

A No-Word-Segmentation Hierarchical Clustering Approach to Chinese Web Search Results.

Density-Based Clustering Algorithm for Hybrid Coding Detection in Search Engines

A Phrase-Based Method For Hierarchical Clustering Of Web Snippets

Learning to Cluster Web Search Results.

An online clustering algorithm for Chinese web snippets based on Generalized Suffix Array

Chinese Word Segmentation Evaluation Methodology Based on Web Search Engines

Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure

Towards Unified Chinese Segmentation Algorithm

PCCS：A FAST CLUSTERING AND CLASSIFICATION METHOD FOR WEB DOCUMENT

Incorporate Web Search Technology to Solve Out-of-Vocabulary Words in Chinese Word Segmentation.

Clustering Web Search Results Using Semantic Information

Hierarchical Clustering of WWW Image Search Results Using Visual, Textual and Link Information

On Combining Link and Contents Information for Web Page Clustering

Chinese Word Segmentation with Heterogeneous Graph Neural Network

C4-2: Combining Link and Contents in Clustering Web Search Results to Improve Information Interpretation

An efficient user-oriented clustering of web search results

Chinese Word Similarity Computing Based on Semantic Tree

Query Segmentation for Relevance Ranking in Web Search

Query Result Clustering For Object-Level Search

Suffix Tree Based Label Generation Method for Web Search Results Clustering

Web Documents Clustering with Interest Links