Abstract:Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic i.e. sharing similar geometric structure, which does not hold in practice, leading to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. In our work, we first pre-align the low-resource and related language embedding spaces using offline methods to mitigate the assumption of isometry. Following this, we use joint training methods to develops CLWEs for the related language and the target embed-ding space. Finally, we remap the pre-aligned low-resource space and the target space to generate the final CLWEs. We show consistent gains over current methods in both quality and degree of isomorphism, as measured by bilingual lexicon induction (BLI) and eigenvalue similarity respectively, across several language pairs: {Nepali, Finnish, Romanian, Gujarati, Hungarian}-English. Lastly, our analysis also points to the relatedness as well as the amount of related language data available as being key factors in determining the quality of embeddings achieved.

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

Jointly Learning Bilingual Word Embeddings and Alignments

Isomorphic Cross-lingual Embeddings for Low-Resource Languages

Word Translation Without Parallel Data

Pre-trained Word Embedding Based Parallel Text Augmentation Technique for Low-Resource NMT in Favor of Morphologically Rich Languages

Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Obtaining Parallel Sentences in Low-Resource Language Pairs with Minimal Supervision

Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring

Multilingual acoustic word embedding models for processing zero-resource languages

A Three-Pronged Approach to Cross-Lingual Adaptation with Multilingual LLMs

Massively Parallel Cross-Lingual Learning in Low-Resource Target Language Translation

Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition

Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context

Unsupervised Cross-Lingual Sentence Representation Learning via Linguistic Isomorphism

Learning Multilingual Sentence Embeddings From Monolingual Corpus

Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment

Towards Multi-Sense Cross-Lingual Alignment of Contextual Embeddings

Learning Tibetan-Chinese Cross-Lingual Word Embeddings

Improving Word Embeddings via Combining with Complementary Languages.