Abstract:With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.

Research on Chinese Keywords Extraction Based on Characters Sequence Annotation

Research on Keywords Indexing for Chinese Bibliography Based on Word Roles Annotation

Exploring Simultaneous Keyword and Key Sentence Extraction

Chinese Keyword Extraction Based on N-Gram and Word Co-occurrence

Chinese Keyword Extraction Based on Word Platform

Chinese Keyword Extraction Algorithm Based on Neighbour Words

Researches on Full Text Retrieval

An Application Of Lexical Semantics Of Chinese In Webpage Keyword Extraction Algorithm

News-oriented Automatic Chinese Keyword Indexing

On the Unsupervised Analysis of Domain-Specific Chinese Texts

Statistical Analyses on Chinese Ancient Books fo Information Retrieval

Metadata Extraction System for Chinese Books

Automatic Keyword Extraction Based on Phrase Network

Automatic keyphrase extraction from chinese news documents

Empirical Study on Character Level Neural Network Classifier for Chinese Text.

SemBRS: A Semantic Analysis Based Book Retrieval Approach.

Chinese Name Entity Extraction System Based on a Hybrid Model

Keyword Extraction Based on Tf/idf for Chinese News Document

A Multi-oriented Chinese Keyword Spotter Guided by Text Line Detection

Statistical Learning and Analyses of Chinese Ancient Books for Information Retrieval

Word extraction based on semantic constraints in chinese word-formation