Research on the Automatic Word Segmentation of The Book of Songs under Multi-dimensional Domain Knowledge
wang shanshan,wang dongbo,huang shuiqing,he lin
DOI: https://doi.org/10.3772/j.issn.1000-0135.2018.02.007
2018-01-01
Abstract:The Book of Songs is the earliest anthology of poetry in China: it is one of the thirteen classic books of Confucian tradition. The Book of Songs is ranked the first of the ancient canonical Five Classics. The Five Classics include Yijing("Classic of Changes"), the Shujing("Classic of History"), The Book of Songs, the Collection of Rituals, and the Chunqiu("Spring and Autumn Annals"). The connotations of The Book of Songs are abundant, reflecting all aspects of social life in the Zhou Dynasty, such as labor and love, war and corvee oppression and rebellion, customs and marriage, ancestor worship and banquets, and even astronomy, geomorphology, animals, and plants. It is a mirror of Zhou Dynasty society, known as The Life Encyclopedia of Ancient Society. Moreover, The Book of Songs is the textbook of ancient Chinese political ethics, aesthetic education, and naturalism. With the extensive application of humanities computing, this paper combines the Sinological Index Series with the domain knowledge of the Mao Shi Index, and studies the automatic word segmentation of The Book of Songs using the machine learning method. Based on the corpus of the manual word segmentation of The Book of Songs, the method of combining the Guang Yun and statistical analysis was used to get 23 sets of feature templates that fuse different characteristics knowledge and then producing machine learning segmentation model by training. The performance of each word segmentation model is analyzed, and it is found that lexical features have the greatest influence on the word segmentation effect of The Book of Songs, and the harmonic mean F value of the word segmentation model can be up to 97.42%. Finally, the paper uses the domain glossary of the Mao Shi Index to carry out the post-processing of the long word correction with the test performance optimum segmentation model, and obtains the word corpus of The Book of Songs that fuses the expert vocabulary knowledge of the Mao Shi Index. This article integrates knowledge into the multi-dimensional domain to realize the automatic segmentation of The Book of Songs, which provides reference for the related research of the Pre-Qin poetry. Moreover, it inspires the study of the automatic word segmentation of Pre-Qin Classics. The word corpus of The Book of Songs, as part of the Pre-Qin Classics word corpus, has a supporting role to further realize the knowledge mining of the Pre-Qin Classics.