WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

Jinyi Zhang,Ke Su,Ye Tian,Tadahiro Matsumoto
DOI: https://doi.org/10.3390/electronics13071381
IF: 2.9
2024-04-06
Electronics
Abstract:This research introduces WCC-EC 2.0 (Web-Crawled Corpus—English and Chinese), a comprehensive parallel corpus designed for enhancing Neural Machine Translation (NMT), featuring over 1.6 million English-Chinese sentence pairs meticulously gathered via web crawling. This corpus, extracted through an advanced web crawler, showcases the vast linguistic diversity and richness of English and Chinese, uniquely spanning the rarely covered news and music domains. Our methodical approach in web crawling and corpus assembly, coupled with rigorous experiments and manual evaluations, demonstrated its superiority by achieving high BLEU scores, marking significant strides in translation accuracy and model resilience. Its inclusion of these specific areas adds significant value, providing a unique dataset that enriches the scope for NMT research and development. With the rise of NMT technology, WCC-EC 2.0 emerges not only as an invaluable resource for researchers and developers, but also as a pivotal tool for improving translation accuracy, training more resilient models, and promoting interlingual communication.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
The paper primarily addresses the issue of insufficient translation resources between English and Chinese in the field of Neural Machine Translation (NMT), particularly focusing on the scarcity of translation data in the news and music domains. The research team has constructed a new version of the parallel corpus named WCC-EC 2.0, which contains over 1.6 million pairs of English and Chinese sentences, collected from the web using web crawling technology. Specifically, WCC-EC 2.0 is an extension of the previous version, WCC-EC 1.0, which included approximately 340,000 pairs of English-Chinese sentences in the news domain. The new version of the corpus adds about 1.3 million pairs of sentences in the lyrics domain, making the entire corpus more diverse and rich. The purpose of this is to improve the performance of NMT systems, especially in handling colloquial expressions, slang, and polysemous words. The main contributions of the paper include: 1. **Corpus Construction**: Creation of WCC-EC 2.0, a large English-Chinese parallel corpus combining news and lyrics data, significantly enhancing language diversity and applicability, and filling the gap in the music domain corpus. 2. **Quality Evaluation System**: Establishment of a robust manual evaluation system, combined with automatic metrics (such as BLEU scores), to more finely assess translation quality. This system can judge translation results based on natural fluency, completeness, and the use of everyday language. Through the construction, quality inspection, and evaluation of WCC-EC 2.0, the researchers aim to provide a valuable resource for the NMT research community and promote the development of translation accuracy, model robustness, and cross-language communication.