WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

Jinyi Zhang,Ke Su,Ye Tian,Tadahiro Matsumoto

DOI: https://doi.org/10.3390/electronics13071381

IF: 2.9

2024-04-06

Electronics

Abstract:This research introduces WCC-EC 2.0 (Web-Crawled Corpus—English and Chinese), a comprehensive parallel corpus designed for enhancing Neural Machine Translation (NMT), featuring over 1.6 million English-Chinese sentence pairs meticulously gathered via web crawling. This corpus, extracted through an advanced web crawler, showcases the vast linguistic diversity and richness of English and Chinese, uniquely spanning the rarely covered news and music domains. Our methodical approach in web crawling and corpus assembly, coupled with rigorous experiments and manual evaluations, demonstrated its superiority by achieving high BLEU scores, marking significant strides in translation accuracy and model resilience. Its inclusion of these specific areas adds significant value, providing a unique dataset that enriches the scope for NMT research and development. With the rise of NMT technology, WCC-EC 2.0 emerges not only as an invaluable resource for researchers and developers, but also as a pivotal tool for improving translation accuracy, training more resilient models, and promoting interlingual communication.

engineering, electrical & electronic,computer science, information systems,physics, applied

What problem does this paper attempt to address?

The paper primarily addresses the issue of insufficient translation resources between English and Chinese in the field of Neural Machine Translation (NMT), particularly focusing on the scarcity of translation data in the news and music domains. The research team has constructed a new version of the parallel corpus named WCC-EC 2.0, which contains over 1.6 million pairs of English and Chinese sentences, collected from the web using web crawling technology. Specifically, WCC-EC 2.0 is an extension of the previous version, WCC-EC 1.0, which included approximately 340,000 pairs of English-Chinese sentences in the news domain. The new version of the corpus adds about 1.3 million pairs of sentences in the lyrics domain, making the entire corpus more diverse and rich. The purpose of this is to improve the performance of NMT systems, especially in handling colloquial expressions, slang, and polysemous words. The main contributions of the paper include: 1. **Corpus Construction**: Creation of WCC-EC 2.0, a large English-Chinese parallel corpus combining news and lyrics data, significantly enhancing language diversity and applicability, and filling the gap in the music domain corpus. 2. **Quality Evaluation System**: Establishment of a robust manual evaluation system, combined with automatic metrics (such as BLEU scores), to more finely assess translation quality. This system can judge translation results based on natural fluency, completeness, and the use of everyday language. Through the construction, quality inspection, and evaluation of WCC-EC 2.0, the researchers aim to provide a valuable resource for the NMT research community and promote the development of translation accuracy, model robustness, and cross-language communication.

WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain

JParaCrawl v3.0: A Large-scale English-Japanese Parallel Corpus

Automatically Building Large-Scale Named Entity Recognition Corpora from Chinese Wikipedia

Development of Translation Database based on Chinese-English parallel corpora

EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation

Corpus-based Research on English-Chinese Translation Teaching Combining Vocabulary Learning and Practice

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Automatic construction of English/Chinese parallel corpora

A large English–Thai parallel corpus from the web and machine-generated text

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

Sentence Alignment with Parallel Documents Facilitates Biomedical Machine Translation

SubCharacter Chinese-English Neural Machine Translation with Wubi encoding

ChineseWebText: Large-scale High-quality Chinese Web Text Extracted with Effective Evaluation Model

Bilingual Corpus Mining and Multistage Fine-Tuning for Improving Machine Translation of Lecture Transcripts

WeChat Neural Machine Translation Systems for WMT20

English-Chinese Machine Translation Based on Transfer Learning and Chinese-English Corpus

scb-mt-en-th-2020: A Large English-Thai Parallel Corpus