Abstract:Abstract Many approaches to automatic classification begin with some prescribed features. However, the features for Chinese aspect classification are normally prescribed as several integrated linguistic feature sets involving temporal, lexical aspectual or grammatical features. The number of the features is often gradually expanded as the designers try to refine the conditions for classification until at last the features should be optimized to eliminate some of the useless or contradictory features. The features for Chinese aspect classification are difficult to be optimized as they are discrete, quite different from those in other classifications. A model-based approach is proposed in this study to optimize the features for Chinese aspect classification illustrated by ZHE aspect markers by estimating, processing and testing the correlations between the features. As an important preparation for building the model, dummy variables are firstly adopted in this study to represent the discrete Chinese ZHE aspect features. The correlations among the features are then estimated by contingency tables. The highly correlated variables are further combined using the Principal Component Analysis. The performances of the original and the optimized features are finally empirically verified by logistic models. The optimized 26 feature sets from the original 40 feature sets are tested with better performances after comparisons before and after the optimizations. Model-based feature selection approaches extensively used in economics have rarely been applied in NLP for Chinese up until now. It will shed some new light on the NLP feature selection method and have some implications in generating rules for revising the Chinese ZHE aspects to its target English categories before being automatically translated into English categories.

A Model for Linguistic Knowledge Discovery from Large-Scale Corpuses Based on Rough Set Techniques

RESEARCH ON A CHINESE LANGUAGE MODEL BASED ON MULTI KNOWLEDGE SOURCES AND ITS IMPLEMENTATION

Mining Pinyin-to-character Conversion Rules from Large-Scale Corpus: a Rough Set Approach

KICE: A Knowledge Consolidation and Expansion Framework for Relation Extraction.

Reasoning Makes Good Annotators : an Automatic Task-specific Rules Distilling Framework for Low-resource Relation Extraction

Automatic Learning and Refinement Algorithm for Chinese Base Chunk Rules

Applying rough sets to feature extraction in POS tagging

A Model-based Feature Optimization Approach to Chinese Language Processing.

A Hybrid Chinese Language Model based on a Combination of Ontology with Statistical Method

Generalization-based Discovery of Spatial Association Rules with Linguistic Cloud Models

A rule-general abductive learning by rough sets

Linguistic knowledge representation and automatic acquisition based on a combination of ontology with statistical method

A Model-based Feature Optimization Approach to Chinese Language Processing

Using Kullback-Leibler Divergence Language Models to Find Experts in Enterprise Corpora

Chinese WSD based on features obtaining with shallow parsing

A New Algorithm for the Acquisition of Knowledge from Scientific Literature in Specific Fields Based on Natural Language Comprehension.

A Hybrid Language Model Based on Statistics and Linguistic Rules

Rule extraction based on linguistic-valued intuitionistic fuzzy layered concept lattice

A WSD Model for Corpus Construction

Automatic Recognition of Chinese Scientific and Technological Terms Using Integrated Linguistic Knowledge

A Filter-APOSD approach for feature selection and linguistic knowledge discovery