GEB-1.3B: Open Lightweight Large Language Model

Jie Wu,Yufeng Zhu,Lei Shen,Xuqing Lu
2024-06-14
Abstract:Recently developed large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated impressive abilities, and even surpass human-level performance in several tasks. Despite their success, the resource-intensive demands of these models, requiring significant computational power for both training and inference, limit their deployment to high-performance servers. Additionally, the extensive calculation requirements of the models often lead to increased latency in response times. With the increasing need for LLMs to operate efficiently on CPUs, research about lightweight models that are optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a lightweight LLM trained on 550 billion tokens in both Chinese and English languages. We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance. Additionally, we fine-tune the model using 10 million samples of instruction data to enhance alignment. GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B. Notably, the FP32 version of GEB-1.3B achieves commendable inference times on CPUs, with ongoing efforts to further enhance speed through advanced quantization techniques. The release of GEB-1.3B as an open-source model marks a significant contribution to the development of lightweight LLMs, promising to foster further research and innovation in the field.
Computation and Language
What problem does this paper attempt to address?
This paper introduces a lightweight large-scale language model called GEB-1.3B, aiming to address the high computational resource requirements of existing large-scale language models, in order to reduce latency and improve running efficiency on CPUs. GEB-1.3B has 1.3 billion parameters and is trained on 550 billion tokens of Chinese and English text. It utilizes new techniques such as ROPE, Group-Query-Attention, and FlashAttention-2 to accelerate training, and is fine-tuned with 10 million instruction data to enhance its adaptability to human conversation patterns. The paper demonstrates that GEB-1.3B performs well on general benchmark tests such as MMLU, C-Eval, and CMMLU, surpassing similarly sized models like MindLLM-1.3B and TinyLLaMA-1.1B. Furthermore, although its inference time on CPUs (FP32 version) is already fast, the researchers plan to further improve the speed through quantization techniques. The paper also emphasizes toxic evaluation of the model and its inference speed in CPU environments, showcasing the advantages of GEB-1.3B compared to larger-scale models in these aspects. In conclusion, the main objective of the paper is to develop an efficient and lightweight language model that can run on various devices, facilitating research and applications in various fields.