VinaLLaMA: LLaMA-based Vietnamese Foundation Model

Quan Nguyen,Huy Pham,Dung Dao
2023-12-18
Abstract:In this technical report, we present VinaLLaMA, an open-weight, state-of-the-art (SOTA) Large Language Model for the Vietnamese language, built upon LLaMA-2 with an additional 800 billion trained tokens. VinaLLaMA not only demonstrates fluency in Vietnamese but also exhibits a profound understanding of Vietnamese culture, making it a truly indigenous model. VinaLLaMA-7B-chat, trained on 1 million high-quality synthetic samples, achieves SOTA results on key benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese, marking a significant advancement in the Vietnamese AI landscape and offering a versatile resource for various applications.
Computation and Language
What problem does this paper attempt to address?
The main goal of this paper is to introduce VinaLLaMA, an open-source, state-of-the-art large language model (LLM) specifically designed for Vietnamese. VinaLLaMA is built on LLaMA-2 and augmented with 800 billion training tokens. It not only fluently uses Vietnamese but also deeply understands Vietnamese culture, making it a truly localized model. Specifically, the paper addresses the following issues: 1. **Language Inclusivity**: By pre-training on a large amount of Vietnamese data, it addresses the shortcomings of existing English models in handling Vietnamese content. 2. **Cultural Understanding**: Enhances the model's understanding of Vietnamese culture, making it more suitable for localized applications. 3. **Benchmark Performance**: Achieves state-of-the-art results in multiple Vietnamese benchmarks, including VLSP, VMLU, and Vicuna Benchmark Vietnamese. 4. **Bilingual Capability**: Develops a bilingual model capable of handling both Vietnamese and English, expanding its application scenarios. Through these improvements, VinaLLaMA not only enhances the ability to process Vietnamese but also performs on par with or even surpasses the more advanced ChatGPT-3.5-Turbo in certain tasks. Additionally, the paper provides a detailed description of the model's training process, dataset composition, and evaluation methods, offering valuable resources for subsequent research.