Abstract:The advancement of Large Language Models (LLMs) has significantly transformed the field of natural language processing, although the focus on English-centric models has created a noticeable research gap for specific languages, including Vietnamese. To address this issue, this paper presents vi-mistral-x, an innovative Large Language Model designed expressly for the Vietnamese language. It utilizes a unique method of continual pre-training, based on the Mistral architecture, which incorporates grouped-query attention and sliding window attention techniques. This model, vi-Mistral-X, marks a significant step forward in improving the understanding and generation of the Vietnamese language. It introduces an additional phase of continual pre-training, specifically adapted for Vietnamese, enhancing the model's capability in understanding complex language nuances and generating accurate, context-aware Vietnamese text. Through comprehensive testing on various benchmarks, vi-mistral-x has shown to outperform existing Vietnamese LLMs in several key areas, including text classification, question answering, and text generation. Particularly, in the Vietnamese Multitask Language Understanding (VMLU) benchmark, vi-mistral-x sets a new standard, outperforming other available models significantly. This paper highlights the critical role of continual pre-training in advancing language-specific LLMs and opens new avenues for the development of multilingual models. We aim for vi-mistral-x to not just be an important asset for processing the Vietnamese language but also to encourage more advancements in creating large language models for languages that are less represented.

Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding

MaLLaM -- Malaysia Large Language Model

Multi-Lingual Malaysian Embedding: Leveraging Large Language Models for Semantic Representations

MMMModal -- Multi-Images Multi-Audio Multi-turn Multi-Modal

Vi-Mistral-X: Building a Vietnamese Language Model with Advanced Continual Pre-training

BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains

Fine-tuning Large Language Models for Adaptive Machine Translation

Larger-Scale Transformers for Multilingual Masked Language Modeling

MaLA-500: Massive Language Adaptation of Large Language Models

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

Personal Intelligence System UniLM: Hybrid On-Device Small Language Model and Server-Based Large Language Model for Malay Nusantara

Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough

PersianMind: A Cross-Lingual Persian-English Large Language Model

Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs

MistralBSM: Leveraging Mistral-7B for Vehicular Networks Misbehavior Detection

Performance of Recent Large Language Models for a Low-Resourced Language

Meltemi: The first open Large Language Model for Greek

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages