Jointly Learning Bilingual Word Embeddings and Alignments

Song Zhenqiao,Zheng Xiaoqing,Huang Xuanjing
DOI: https://doi.org/10.1007/s10590-021-09283-z
2021-01-01
Machine Translation
Abstract:Learning bilingual word embeddings can be much easier if the parallel corpora are available with their words well aligned explicitly. However, in most cases, the parallel corpora only provide a set of pairs that are semantically equivalent to each other at sentence level. While algorithms have been proposed to obtain word alignments, good alignments are still hard to achieve. In this study, we propose Bilingual word embeddings with soft alignment (BWESA) to learn bilingual word representations from the parallel corpora without explicit word-level alignment information. At the same time, this method learns to make ‘soft’ alignments between words by approximating a distribution for each word in a sentence to estimate how likely the word is aligned to the words in the parallel translation. Unlike previous methods that typically make use of a predetermined word alignment, our learning strategy makes similar words—properly chosen by the continuously improving word alignment—become closer in the shared vector space during the training process. This study is among the first to learn bilingual word alignments and embeddings in a joint manner. The proposed method was evaluated on two cross-lingual tasks (cross-lingual document classification and word translation) and achieved state-of-the-art or comparable results on all the tasks considered.
What problem does this paper attempt to address?