52B to 1T: Lessons Learned via Tele-FLM Series

Xiang Li,Yiqun Yao,Xin Jiang,Xuezhi Fang,Chao Wang,Xinzhang Liu,Zihan Wang,Yu Zhao,Xin Wang,Yuyao Huang,Shuangyong Song,Yongxiang Li,Zheng Zhang,Bo Zhao,Aixin Sun,Yequan Wang,Zhongjiang He,Zhongyuan Wang,Xuelong Li,Tiejun Huang
2024-07-03
Abstract:Large Language Models (LLMs) represent a significant stride toward Artificial General Intelligence. As scaling laws underscore the potential of increasing model sizes, the academic community has intensified its investigations into LLMs with capacities exceeding 50 billion parameters. This technical report builds on our prior work with Tele-FLM (also known as FLM-2), a publicly available 52-billion-parameter model. We delve into two primary areas: we first discuss our observation of Supervised Fine-tuning (SFT) on Tele-FLM-52B, which supports the "less is more" approach for SFT data construction; second, we demonstrate our experiments and analyses on the best practices for progressively growing a model from 52 billion to 102 billion, and subsequently to 1 trillion parameters. We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper mainly addresses two issues: 1. **Exploration of Supervised Fine-tuning Strategies**: Researchers have found that supervised fine-tuning on large-scale language models can achieve good results using relatively small but high-quality datasets. Specifically, they fine-tuned the 5.2 billion parameter Tele-FLM model on a small amount of data in fields including mathematical problems, coding tasks, and multi-turn dialogues, achieving performance similar to or even better than that with larger datasets. This indicates that the strong capabilities of the base model can be well leveraged with a small amount of guided tasks, especially in conventional language understanding and generation tasks. 2. **Method of Gradually Increasing Model Size**: The paper also details the process of gradually expanding from a 5.2 billion parameter model to a 1 trillion parameter model, while maintaining the consistency of model functions and the effectiveness of training. The researchers used a technique called "Function-Preserving Growth," which allows the model to maintain the knowledge learned in previous stages while increasing the number of parameters. In this way, they successfully expanded the model from 5.2 billion parameters to 1 trillion parameters and plan to open-source the final 1 trillion parameter model checkpoint, Tele-FLM-1T, to facilitate further research and development. In summary, this paper aims to explore how to effectively use a small amount of high-quality data to improve the performance of large-scale language models and proposes a method to gradually increase the model size to overcome resource limitations and achieve training of ultra-large-scale models.