Abstract:Measuring the semantic similarity between two sentences (or Semantic Textual Similarity - STS) is fundamental in many NLP applications. Despite the remarkable results in supervised settings with adequate labeling, little attention has been paid to this task in low-resource languages with insufficient labeling. Existing approaches mostly leverage machine translation techniques to translate sentences into rich-resource language. These approaches either beget language biases, or be impractical in industrial applications where spoken language scenario is more often and rigorous efficiency is required. In this work, we propose a multilingual framework to tackle the STS task in a low-resource language e.g. Spanish, Arabic , Indonesian and Thai, by utilizing the rich annotation data in a rich resource language, e.g. English. Our approach is extended from a basic monolingual STS framework to a shared multilingual encoder pretrained with translation task to incorporate rich-resource language data. By exploiting the nature of a shared multilingual encoder, one sentence can have multiple representations for different target translation language, which are used in an ensemble model to improve similarity evaluation. We demonstrate the superiority of our method over other state of the art approaches on SemEval STS task by its significant improvement on non-MT method, as well as an online industrial product where MT method fails to beat baseline while our approach still has consistently improvements.

Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models

Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

Beyond Shared Vocabulary: Increasing Representational Word Similarities across Languages for Multilingual Machine Translation

UsingWord Embedding for Cross-Language Plagiarism Detection

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Embedding Word Similarity with Neural Machine Translation

Deduplicating Training Data Makes Language Models Better

Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems

Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

Improving Multilingual Semantic Textual Similarity with Shared Sentence Encoder for Low-resource Languages

Privacy-Preserving Data Deduplication for Enhancing Federated Learning of Language Models (Extended Version)

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

Multilingual Sentence-Level Semantic Search using Meta-Distillation Learning

MPN: Leveraging Multilingual Patch Neuron for Cross-lingual Model Editing

OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement

A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models

Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs

A Deep Neural Network Approach To Parallel Sentence Extraction

Synonymous Entity Expansion Based Information De-duplication

Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features