Abstract:Now a days, the text document is spontaneously increasing over the internet, e-mail and web pages and they are stored in the electronic database format. To arrange and browse the document it becomes difficult. To overcome such problem the document preprocessing, term selection, attribute reduction and maintaining the relationship between the important terms using background knowledge, WordNet, becomes an important parameters in data mining. In these paper the different stages are formed, firstly the document preprocessing is done by removing stop words, stemming is performed using porter stemmer algorithm, word net thesaurus is applied for maintaining relationship between the important terms, global unique words, and frequent word sets get generated, Secondly, data matrix is formed, and thirdly terms are extracted from the documents by using term selection approaches tf-idf, tf-df, and tf2 based on their minimum threshold value. Further each and every document terms gets preprocessed, where the frequency of each term within the document is counted for representation. The purpose of this approach is to reduce the attributes and find the effective term selection method using WordNet for better clustering accuracy. Experiments are evaluated on Reuters Transcription Subsets, wheat, trade, money grain, and ship, Reuters 21578, Classic 30, 20 News group (atheism), 20 News group (Hardware), 20 News group (Computer Graphics) etc.

Segmented document classification: problem and solution

Web Information Segmentation Method Based on DOM Structure Tree

CNN Based Page Object Detection in Document Images

Hierarchically Classifying Chinese Web Documents Without Dictionary Support And Segmentation Procedure

Document Image Segmentation Using Gabor Wavelet and Kernel-based Methods

Chinese Document Categorization without Dictionary Support and Segmentation Processing

Self-Switching Classification Framework for Titled Documents.

Page Segmentation of Chinese Newspapers

Multi-documents Automatic Abstracting Based on Text Clustering and Semantic Analysis

A Framework for Titled Document Categorization with Modified Multinomial Naivebayes Classifier

A CHINESE DOCUMENT CATEGORIZATION SYSTEM WITHOUT DICTIONARY SUPPORT AND SEGMENTATION PROCESSING

Word Segmentation for Chinese Judicial Documents

Chinese Documents Classification Based on N-Grams

Towards Unified Chinese Segmentation Algorithm

Chinese Documents Categorization Based on N-gram Information

Using multiple discriminant analysis approach for linear text segmentation

A hybrid Chinese word segmentation model for quality management-related texts based on transfer learning

A Semantic approach for effective document clustering using WordNet

Long Text Classification with Segmentation

Research of Chinese Word Segmentation on Medical Documents

Domain-Aware Word Segmentation for Chinese Language: A Document-Level Context-Aware Model