Enhancing Multilingual Universal Sentence Embeddings by Monolingual Contrastive Learning

Fangwei Ou,Jinan Xu
DOI: https://doi.org/10.1109/icnlp60986.2024.10692736
2024-01-01
Abstract:Contrastive learning has recently been shown to be an effective method for learning sentence embeddings. However, this method is mainly used in English sentence embeddings, and has less application in multilingual sentence embeddings. Meanwhile, existing multilingual or cross-lingual models use a large amount of parallel corpora in the pre-training or fine-tuning process. In this work, we propose CoSCSE, a contrastive learning framework based on code-switched monolingual corpora, which does not rely on parallel corpora that are difficult to obtain, can also efficiently learn multilingual sentence embeddings. Specifically, we classify CoSCSE into two categories based on whether the trained monolingual corpus has supervision signals. In unsupervised CoSCSE, we extend SimCSE and mSimCSE to multiple monolingual training scenarios. In supervised CoSCSE, the code-switching augmentation strategy of monolingual corpora are used. We conduct experiments on an extended version of the multilingual and cross-lingual semantic textual similarity (STS) task 2017. Experimental results show that CoSCSE is as competitive as the model using a large amount of parallel corpora, and achieves an absolute improvement for 2.3% and 1.9% compared with mSimCSE in unsupervised and supervised scenarios respectively, verifying the effectiveness of our method.
What problem does this paper attempt to address?