Xmodel-LM Technical Report

Yichuan Wang,Yang Liu,Yu Yan,Qun Wang,Xucheng Huang,Ling Jiang
2024-06-26
Abstract:We introduce Xmodel-LM, a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens. Trained on our self-built dataset (Xdata), which balances Chinese and English corpora based on downstream task optimization, Xmodel-LM exhibits remarkable performance despite its smaller size. It notably surpasses existing open-source language models of similar scale. Our model checkpoints and code are publicly accessible on GitHub at <a class="link-external link-https" href="https://github.com/XiaoduoAILab/XmodelLM" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper introduces a compact and efficient 1.1 billion parameter language model called Xmodel-LM, which is pretrained on approximately 200 trillion tokens. Despite its smaller scale compared to existing open-source models, Xmodel-LM exhibits performance comparable to state-of-the-art models in multiple natural language processing benchmark tests. The paper provides detailed descriptions of data sources (including balanced Chinese and English data), data processing (such as data cleansing, deduplication, and sampling), design of a custom tokenizer, model architecture (employing a similar architecture to LLaMA, combining techniques like rotational position embeddings and RMSNorm), and training process (utilizing distributed data parallelism and optimization techniques to improve efficiency). Through evaluations on a range of common-sense reasoning tasks and problem-solving tasks, Xmodel-LM demonstrates good performance, surpassing other models of similar scale in certain metrics. Due to its smaller size and efficiency, it exhibits potential for practical applications.