Sentence Alignment for Ancient and Modern Chinese Parallel Corpus

ying liu,nan wang
DOI: https://doi.org/10.1007/978-3-642-34240-0_54
2012-01-01
Abstract:This paper describes a statistical method to align sentences of ShiJi ancient and modern Chinese parallel corpus. The statistical model (log-linear model) makes use of sentence length, alignment mode, co-occurring Hanzi characters as features. A probabilistic score is assigned to each proposed correspondence of sentences, based on the scaled difference of lengths of the two sentences, the change of alignment mode and the variance of the number of co-occurring Hanzi characters. The precision of sentence alignment for test corpus is 96.2%. Furthermore, we discuss the influences of different features and how to combine the features to improve the precision and the recall rate.
What problem does this paper attempt to address?