Aquila2 Technical Report

Bo-Wen Zhang,Liangdong Wang,Jijie Li,Shuhao Gu,Xinya Wu,Zhengduo Zhang,Boyan Gao,Yulong Ao,Guang Liu
2024-08-14
Abstract:This paper introduces the Aquila2 series, which comprises a wide range of bilingual models with parameter sizes of 7, 34, and 70 billion. These models are trained based on an innovative framework named HeuriMentor (HM), which offers real-time insights into model convergence and enhances the training process and data management. The HM System, comprising the Adaptive Training Engine (ATE), Training State Monitor (TSM), and Data Management Unit (DMU), allows for precise monitoring of the model's training progress and enables efficient optimization of data distribution, thereby enhancing training effectiveness. Extensive evaluations show that the Aquila2 model series performs comparably well on both English and Chinese benchmarks. Specifically, Aquila2-34B demonstrates only a slight decrease in performance when quantized to Int4. Furthermore, we have made our training code (<a class="link-external link-https" href="https://github.com/FlagOpen/FlagScale" rel="external noopener nofollow">this https URL</a>) and model weights (<a class="link-external link-https" href="https://github.com/FlagAI-Open/Aquila2" rel="external noopener nofollow">this https URL</a>) publicly available to support ongoing research and the development of applications.
Computation and Language
What problem does this paper attempt to address?
The paper mainly introduces the Aquila2 series models and the HeuriMentor framework behind them. The Aquila2 series includes large-scale bilingual (Chinese and English) language models with parameter sizes of 700 million, 3.4 billion, and 7 billion. These models are designed to improve training efficiency and model performance, especially when dealing with dynamically changing data combinations. ### Main Contributions 1. **HeuriMentor Framework**: This is an innovative framework aimed at optimizing the training process and data management by monitoring model convergence in real-time. The framework consists of three main components: - **Adaptive Training Engine (ATE)**: Used to update the data mixing method for model training based on the latest data sources. - **Training State Monitor (TSM)**: Used to evaluate the state of the model trained by ATE in real-time. - **Data Management Unit (DMU)**: Responsible for collecting and organizing data from the internet and partners for model training. 2. **Aquila2 Model Series**: Models trained through the HeuriMentor framework exhibit good performance in both English and Chinese benchmarks. Particularly, the Aquila2-34B model shows very little performance degradation when quantized to Int4 precision. ### Problems Addressed - **Improving Training Efficiency**: Traditional training methods struggle to adapt to changes in data composition or the integration of new data. The HeuriMentor framework addresses this issue by dynamically adjusting data combinations, thereby improving training efficiency. - **Resource-Intensive Training**: Training large language models typically requires a significant amount of time and computational resources. Through components like ATE, TSM, and DMU, the Aquila2 series models can utilize resources more efficiently for training. - **Performance Optimization**: Experiments show that even with reduced training data, the Aquila2-34B model maintains good performance, particularly exceeding baseline models in average scores across 21 different datasets. ### Training Configuration and Technical Details - **Model Architecture**: The Aquila2 series adopts Grouped Query Attention mechanism and Rotary Position Embedding technology, which help improve model efficiency and capture spatiotemporal patterns in sequential data. - **Training Strategies**: Mixed precision training, data parallelism, and tensor parallelism strategies are employed, combined with distributed optimizers to enhance training efficiency. - **Data Management**: Carefully designed data management strategies ensure that the model can learn from high-quality and diverse datasets while also considering data security issues. In summary, this research aims to improve the training efficiency and performance of large-scale language models by introducing the HeuriMentor framework and corresponding training techniques and strategies, with a particular focus on effectively managing and utilizing training data.