GEB-1.3B: Open Lightweight Large Language Model

Jie Wu,Yufeng Zhu,Lei Shen,Xuqing Lu

2024-06-14

Abstract:Recently developed large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated impressive abilities, and even surpass human-level performance in several tasks. Despite their success, the resource-intensive demands of these models, requiring significant computational power for both training and inference, limit their deployment to high-performance servers. Additionally, the extensive calculation requirements of the models often lead to increased latency in response times. With the increasing need for LLMs to operate efficiently on CPUs, research about lightweight models that are optimized for CPU inference has emerged. In this work, we introduce GEB-1.3B, a lightweight LLM trained on 550 billion tokens in both Chinese and English languages. We employ novel training techniques, including ROPE, Group-Query-Attention, and FlashAttention-2, to accelerate training while maintaining model performance. Additionally, we fine-tune the model using 10 million samples of instruction data to enhance alignment. GEB-1.3B exhibits outstanding performance on general benchmarks such as MMLU, C-Eval, and CMMLU, outperforming comparative models such as MindLLM-1.3B and TinyLLaMA-1.1B. Notably, the FP32 version of GEB-1.3B achieves commendable inference times on CPUs, with ongoing efforts to further enhance speed through advanced quantization techniques. The release of GEB-1.3B as an open-source model marks a significant contribution to the development of lightweight LLMs, promising to foster further research and innovation in the field.

Computation and Language

What problem does this paper attempt to address?

This paper introduces a lightweight large-scale language model called GEB-1.3B, aiming to address the high computational resource requirements of existing large-scale language models, in order to reduce latency and improve running efficiency on CPUs. GEB-1.3B has 1.3 billion parameters and is trained on 550 billion tokens of Chinese and English text. It utilizes new techniques such as ROPE, Group-Query-Attention, and FlashAttention-2 to accelerate training, and is fine-tuned with 10 million instruction data to enhance its adaptability to human conversation patterns. The paper demonstrates that GEB-1.3B performs well on general benchmark tests such as MMLU, C-Eval, and CMMLU, surpassing similarly sized models like MindLLM-1.3B and TinyLLaMA-1.1B. Furthermore, although its inference time on CPUs (FP32 version) is already fast, the researchers plan to further improve the speed through quantization techniques. The paper also emphasizes toxic evaluation of the model and its inference speed in CPU environments, showcasing the advantages of GEB-1.3B compared to larger-scale models in these aspects. In conclusion, the main objective of the paper is to develop an efficient and lightweight language model that can run on various devices, facilitating research and applications in various fields.

GEB-1.3B: Open Lightweight Large Language Model

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

GLM-130B: An Open Bilingual Pre-trained Model

ELMS: Elasticized Large Language Models On Mobile Devices

Understanding LLMs: A Comprehensive Overview from Training to Inference

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

FLM-101B: An Open LLM and How to Train It with $100K Budget

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment

OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning

OpenBA-V2: Reaching 77.3 Pruning

Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

PermLLM: Private Inference of Large Language Models within 3 Seconds under WAN

An Empirical Analysis and Resource Footprint Study of Deploying Large Language Models on Edge Devices

Me LLaMA: Foundation Large Language Models for Medical Applications