Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation

Maosong Sun,Dongliang Xu,Benjamin Ka-Yin T'sou,Huaming Lu
DOI: https://doi.org/10.1007/978-3-540-87391-4_20
2008-01-01
Abstract:This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction.
What problem does this paper attempt to address?