Chinese Abbreviation Identification Using Abbreviation-Template Features *

Xu Sun,Houfeng Wang
2006-01-01
Abstract:Chinese abbreviations are frequently used without being defined, which has brought much difficulty into NLP. In this study, the definition-independent abbreviation identification problem is proposed and resolved as a classification task in which abbreviation candidates are classified as either ‘abbreviation’ or ‘non-abbreviation’ according to the posterior probability. To meet our aim of identifying new abbreviations from existing ones, our solution is to add generalization capability to the abbreviation lexicon by replacing words with word classes and therefore create abbreviation-templates. By utilizing abbreviation-template features as well as context information, a SVM approach is employed as the classifier. The evaluation on a raw Chinese corpus obtains an encouraging performance. Our experiments further demonstrate the improvement after integrating with extended word clustering (We design it to enable a joint learning of word classes), morphological analysis, substring analysis and person name identification. To our knowledge, this is the first definition-independent machine learning approach for Chinese abbreviation identification.
What problem does this paper attempt to address?