Abstract:Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{<a class="link-external link-https" href="https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation" rel="external noopener nofollow">this https URL</a>}.

Xmodel-LM Technical Report

Xmodel-1.5: An 1B-scale Multilingual LLM

YuLan: An Open-source Large Language Model

Xmodel-2 Technical Report

ChuXin: 1.6B Technical Report

Baichuan 2: Open Large-scale Language Models

Tele-FLM Technical Report

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

YuLan-Mini: An Open Data-efficient Language Model

TinyLlama: An Open-Source Small Language Model

PolyLM: An Open Source Polyglot Large Language Model

Xlm-v: Overcoming the vocabulary bottleneck in multilingual masked language models

Qwen Technical Report

Stable LM 2 1.6B Technical Report

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

YAYI 2: Multilingual Open-Source Large Language Models

Skywork: A More Open Bilingual Foundation Model

Small Language Models: Survey, Measurements, and Insights

Larger-Scale Transformers for Multilingual Masked Language Modeling

MindLLM: Lightweight Large Language Model Pre-Training, Evaluation and Domain Application

Xwin-LM: Strong and Scalable Alignment Practice for LLMs