Towards Better Translations from Classical to Modern Chinese: A New Dataset and a New Method.
Zongyuan Jiang,Jiapeng Wang,Jiahuan Cao,Xue Gao,Lianwen Jin
DOI: https://doi.org/10.1007/978-3-031-44693-1_31
2023-01-01
Abstract:Classical Chinese (Ancient Chinese) is the written language that was used in ancient China and has been an important carrier of Chinese culture for thousands of years. Numerous ideas of modern disciplines have been influenced or derived from it, including mathematics, medicine, engineering, etc., which demonstrated the necessity for us to understand, inherit and disseminate it. Consequently, there is an urgent need to develop neural machine translation to facilitate the comprehension of classical Chinese sentences. In this paper, we introduce a high-quality and comprehensive dataset called C2MChn, consisting of about 615K sentence pairs for the translation between classical and modern Chinese. To the best of our knowledge, this is the first dataset covering a wide range of domains including history books, Buddhist classics, Confucian classics, etc. Furthermore, through the analysis of classical and modern Chinese, we have proposed a simple yet effective method, named Syntax-Semantics Awareness Transformer (SSAT). It’s capable of leveraging both syntactic and semantic information which are indispensable for better translating classical Chinese. Experiments show that our model can achieve better BLEU scores than several state-of-the-art methods as well as two general translation engines including Microsoft and Baidu APIs. The dataset and related resources will be released at: https://github.com/Zongyuan-Jiang/C2MChn .