Abstract:An effective text representation scheme dominates the performance of text categorization system. However, based on the assumption of independent terms, the traditional schemes which tediously use term frequency (TF) and document frequency (DF) are insufficient for capturing enough information of a document and result in poor performance. To overcome this limitation, we investigate exploring the relationships between different terms of the same class tendency and the way of measuring the importance of a repetitive term in a document. In this paper, a group of novel term weighting factors are proposed to enhance the category contribution for each term. Then, based on a novel strategy of generating passages from document, we present two schemes, the weighted co-contributions of different terms corresponding to the class tendency and the weighted co-contributions for each term in different passages, to achieve improvements on text representation. The prior scheme works in a dimensionality reduction mode while the second one runs in the conventional way. By employing the support vector machine (SVM) classifier, experiments on four benchmark corpora show that the proposed schemes could achieve a consistent better performance than the conventional methods in both efficiency and accuracy. Further analysis also confirms some promising directions for the future works.

A comparative study on text representation schemes in text categorization

Efficient Representation of Text with Multiple Perspectives

Study on the Influences of Text Categorization Performance Based on Corpus Information Measurement

A Comparative Study on Feature Weight in Text Categorization

Experimental Study On Representing Units In Chinese Text Categorization

Efficient text representation via weighted co-contributions of terms on class tendency

Improving Text Categorization Using the Importance of Words in Different Categories

Aggressive Dimensionality Reduction With Reinforcement Local Feature Selection For Text Categorization

Text Categorization with Lee Model

A Text Categorization Method Based on Features Clustering

Text Representations for Text Categorization: A Case Study in Biomedical Domain

Term Selection and Weighting Approach Based on Key Words in Text Categorization

A Study On Feature Weighting In Chinese Text Categorization

A Comparative Study of Tf*Idf, Lsi and Multi-Words for Text Classification

Improved Comprehensive Measurement Feature Selection Method for Text Categorization

Text Categorization Based On Term Co-Occurrence Concept

A Novel Term Weighting Scheme with Distributional Coefficient for Text Categorization with Support Vector Machine

An Improved Feature Weighting Strategy in Chinese Text Categorization

Research on Algorithm of Text Feature Selection and Weighting Based on Category

Learning Effective Features for Chinese Text Categorization

Term Weighting Scheme with Enhanced Category Contribution for Text Categorization