Abstract:Automatic grammatical error detection for Chinese has been a big challenge for NLP researchers for a long time, mostly due to the flexible and irregular ways in the expressing of this language. Strictly speaking, there is no evidence of a series of formal and strict grammar rules for Chinese, especially for the spoken Chinese, making it hard for foreigners to master this language. The CFL shared task provides a platform for the researchers to develop automatic engines to detect grammatical errors based on a number of manually annotated Chinese spoken sentences. This paper introduces HITSZ’s system for this year’s Chinese grammatical error diagnosis (CGED) task. Similar to the last year’s task, we put our emphasis mostly on the error detection level and error type identification level but did little for the position level. For all our models, we simply use supervised machine learning methods constrained to the given training corpus, with neither any heuristic rules nor any other referenced materials (except for the last years’ data). Among the three runs of results we submitted, the one using the ensemble classifier Random Feature Subspace (HITSZ_Run1) gained the best performance, with an optimal F1 of 0.6648 for the detection level and 0.2675 for the identification level.

Chinese Grammatical Error Diagnosis Using Ensemble Learning