Abstract:Bilingual word embedding has been shown to be helpful for Statistical Machine Translation (SMT). However, most existing methods suffer from two obvious drawbacks. First, they only focus on simple contexts such as an entire document or a fixed-sized sliding window to build word embedding and ignore latent useful information from the selected context. Second, the word sense but not the word should be the minimal semantic unit; however, most existing methods still use word representation. To overcome these drawbacks, this article presents a novel Graph-Based Bilingual Word Embedding (GBWE) method that projects bilingual word senses into a multidimensional semantic space. First, a bilingual word co-occurrence graph is constructed using the co-occurrence and pointwise mutual information between the words. Then, maximum complete subgraphs (cliques), which play the role of a minimal unit for bilingual sense representation, are dynamically extracted according to the contextual information. Consequently, correspondence analysis, principal component analyses, and neural networks are used to summarize the clique-word matrix into lower dimensions to build the embedding model. Without contextual information, the proposed GBWE can be applied to lexical translation. In addition, given contextual information, GBWE is able to give a dynamic solution for bilingual word representations, which can be applied to phrase translation and generation. Empirical results show that GBWE can enhance the performance of lexical translation, as well as Chinese/French-to-English and Chinese-to-Japanese phrase-based SMT tasks (IWSLT, NTCIR, NIST, and WAT).

An efficient method for determining bilingual word classes

Forming Word Classes by Statistical Clustering for Statistical Language Modelling

Language Clustering with Word Co-Occurrence Networks Based on Parallel Texts

Morphologically Aware Word-Level Translation

A Novel Bilingual Word Embedding Method for Lexical Translation Using Bilingual Sense Clique

Graph-Based Bilingual Word Embedding for Statistical Machine Translation

Augmenting Statistical Machine Translation with Subword Translation of Out-of-Vocabulary Words

On smoothing techniques for bigram-based natural language modelling

Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Bayesian Optimisation for Machine Translation

Computing Word Classes Using Spectral Clustering

A Novel Word Reordering Method For Statistical Machine Translation

A Systematic Analysis of Subwords and Cross-Lingual Transfer in Multilingual Translation

Optimizing Data Usage for Low-Resource Speech Recognition

Finding Better Subword Segmentation for Neural Machine Translation

A Bilingual Graph-Based Semantic Model for Statistical Machine Translation.

Translation ambiguity but not word class predicts translation performance

Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport

Learning Word Reorderings For Hierarchical Phrase-Based Statistical Machine Translation

Efficient representation and fast look-up of Maximum Entropy language models.

Finding the Optimal Vocabulary Size for Neural Machine Translation