Character-Level Chinese Dependency Parsing via Modeling Latent Intra-Word Structure

Yang Hou,Zhenghua Li
2024-06-06
Abstract:Revealing the syntactic structure of sentences in Chinese poses significant challenges for word-level parsers due to the absence of clear word boundaries. To facilitate a transition from word-level to character-level Chinese dependency parsing, this paper proposes modeling latent internal structures within words. In this way, each word-level dependency tree is interpreted as a forest of character-level trees. A constrained Eisner algorithm is implemented to ensure the compatibility of character-level trees, guaranteeing a single root for intra-word structures and establishing inter-word dependencies between these roots. Experiments on Chinese treebanks demonstrate the superiority of our method over both the pipeline framework and previous joint models. A detailed analysis reveals that a coarse-to-fine parsing strategy empowers the model to predict more linguistically plausible intra-word structures.
Computation and Language
What problem does this paper attempt to address?
The paper mainly addresses two key issues in Chinese dependency parsing: 1. **Challenges posed by the lack of explicit word boundaries**: Since Chinese does not have clear word boundaries, traditional dependency parsing methods usually rely on word-level treebanks, requiring text to be segmented before analysis. This approach not only adds extra complexity but also makes the analysis results susceptible to segmentation accuracy. 2. **Transition from word-level to character-level**: To overcome the above problem, researchers have attempted to shift to character-level Chinese dependency parsing. However, due to the lack of character-level Chinese treebank resources, researchers need to convert word-level trees into character-level trees. Previous methods either required manual annotation of internal word structures or used simplified rules to define pseudo-internal structures. These methods are either time-consuming or fail to accurately represent the syntactic roles of characters. To address these issues, the paper proposes a new method that models potential internal word structures for character-level Chinese dependency parsing. This method allows for the implicit representation of all possible internal structures within words and introduces a constrained Eisner algorithm to ensure compatibility between the generated character-level trees and word-level trees. Additionally, a coarse-to-fine parsing strategy is proposed to improve parsing accuracy and generate internal word structures that better conform to linguistic principles. Experimental results show that this method outperforms pipeline frameworks and previous joint models on Chinese treebanks. Further analysis reveals the importance of the proposed constraints in improving parsing performance and tree integrity, and demonstrates the distribution of predicted internal word structures, confirming that the method can effectively infer complex internal word structures.