Abstract:Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.

Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew

The Knesset Corpus: An Annotated Corpus of Hebrew Parliamentary Proceedings

Introducing DictaLM -- A Large Generative Language Model for Modern Hebrew

Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All

Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language

IsraParlTweet: The Israeli Parliamentary and Twitter Resource

Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities

MenakBERT -- Hebrew Diacriticizer

AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your Hebrew NLP Application With

A Language Modeling Approach to Diacritic-Free Hebrew TTS

Offensive Hebrew Corpus and Detection using BERT

D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models

HeBERT & HebEMO: a Hebrew BERT Model and a Tool for Polarity Analysis and Emotion Recognition

DeepParliament: A Legal domain Benchmark & Dataset for Parliament Bills Prediction

EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora

Political DEBATE: Efficient Zero-shot and Few-shot Classifiers for Political Text

ASR Bundestag: A Large-Scale political debate dataset in German

HeRo: RoBERTa and Longformer Hebrew Language Models

Training Language Models to Win Debates with Self-Play Improves Judge Accuracy

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi