Abstract:Multilingual pre-trained language models have achieved impressive results on most natural language processing tasks. However, the performance is inhibited due to capacity limitations and their under-representation of pre-training data, especially for languages with limited resources. This has led to the creation of tailored pre-trained language models, in which the models are pre-trained on large amounts of monolingual data or domain specific corpus. Nevertheless, compared to relying on multiple monolingual models, utilizing multilingual models offers the advantage of multilinguality, such as generalization on cross-lingual resources. To combine the advantages of both multilingual and monolingual models, we propose KDDA - a framework that leverages monolingual models to a single multilingual model with the aim to improve sentence representation for Vietnamese. KDDA employs teacher-student framework and cross-lingual transfer that aims to adopt knowledge from two monolingual models (teachers) and transfers them into a unified multilingual model (student). Since the representations from the teachers and the student lie on disparate semantic spaces, we measure discrepancy between their distributions by using Sinkhorn Divergence - an optimal transport distance. We conduct experiments on two Vietnamese natural language understanding tasks, including machine reading comprehension and natural language inference. Experimental results show that our model outperforms other state-of-the-art models and yields competitive performances.

PhoBERT: Pre-trained language models for Vietnamese

ViDeBERTa: A powerful pre-trained language model for Vietnamese

ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing

PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing

PhoGPT: Generative Pre-training for Vietnamese

VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding

A Novel Pretrained General-Purpose Vision Language Model for the Vietnamese Language

Investigating Monolingual and Multilingual BERTModels for Vietnamese Aspect Category Detection

From Universal Language Model to Downstream Task: Improving RoBERTa-Based Vietnamese Hate Speech Detection

XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech

PhoWhisper: Automatic Speech Recognition for Vietnamese

Vietnamese Sentiment Analysis: An Overview and Comparative Study of Fine-tuning Pretrained Language Models

Transformer-Based Contextualized Language Models Joint with Neural Networks for Natural Language Inference in Vietnamese

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

Vietnamese AI Generated Text Detection

WangchanBERTa: Pretraining transformer-based Thai Language Models

Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

LMCK: pre-trained language models enhanced with contextual knowledge for Vietnamese natural language inference

LaVy: Vietnamese Multimodal Large Language Model

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords

Improving sentence representation for vietnamese natural language understanding using optimal transport