Automatic Identification of Chinese Dialect Based on the Data from Chinese Pinyin Input Method

ZHANG Yan,ZHANG Yang,SUN Maosong
DOI: https://doi.org/10.3969/j.issn.1003-0077.2013.05.004
2013-01-01
Abstract:The study of dialect is composed of voice study,vocabulary study and grammar study,of which the first step is to recognize the dialect vocabulary.By now,collection of Chinese idiom words is mainly accomplished by experts,and it is time-consuming and labor-intensive.With the development of information technology,people communicate widely through the network,and thus input method data contains vast amount of vocabulary resources as well as the geographical information,which can help automatically discover dialect words corpus.However,in literature,there have been very few studies on how to exploit the input method data to systematically investigate the dialects.Therefore this paper analyzes the user behavior of Chinese input method,and based on which we propose to automatically discover the geographical dialect vocabulary.Specifically,the paper gets the two representative features of dialects in Chinese input method,and uses different combinations of these two features to recognize dialect words.Finally,extensive experiments are performed to evaluate the impacts of the feature combinations on the dialect word recognition.
What problem does this paper attempt to address?