OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

Juntao Li,Zecheng Tang,Yuyang Ding,Pinzheng Wang,Pei Guo,Wangjie You,Dan Qiao,Wenliang Chen,Guohong Fu,Qiaoming Zhu,Guodong Zhou,Min Zhang
2023-10-02
Abstract:Large language models (LLMs) with billions of parameters have demonstrated outstanding performance on various natural language processing tasks. This report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model, to contribute an LLM variant to the Chinese-oriented open-source model community. We enhance OpenBA with effective and efficient techniques as well as adopt a three-stage training strategy to train the model from scratch. Our solution can also achieve very competitive performance with only 380B tokens, which is better than LLaMA-70B on the BELEBELE benchmark, BLOOM-176B on the MMLU benchmark, GLM-130B on the C-Eval (hard) benchmark. This report provides the main details to pre-train an analogous model, including pre-training data processing, Bilingual Flan data collection, the empirical observations that inspire our model architecture design, training objectives of different stages, and other enhancement techniques. Additionally, we also provide the fine-tuning details of OpenBA on four downstream tasks. We have refactored our code to follow the design principles of the Huggingface Transformers Library, making it more convenient for developers to use, and released checkpoints of different training stages at <a class="link-external link-https" href="https://huggingface.co/openBA" rel="external noopener nofollow">this https URL</a>. More details of our project are available at <a class="link-external link-https" href="https://github.com/OpenNLG/openBA.git" rel="external noopener nofollow">this https URL</a>.
Computation and Language
What problem does this paper attempt to address?
The main goal of this paper is to propose a new open-source, bilingual asymmetric sequence-to-sequence (seq2seq) model named OpenBA, which has 15 billion parameters and is pre-trained from scratch. OpenBA aims to fill the gap in the current open-source community for high-quality large language models in the Chinese domain, especially models based on the encoder-decoder architecture. Specifically, this study addresses the following key issues: 1. **Model Contribution**: As a large language model, OpenBA particularly emphasizes its application capabilities in the Chinese domain and its asymmetric encoder-decoder structural design, which helps improve the ability to perform generative tasks. 2. **Dataset Construction**: The paper details how the data used for pre-training was collected and processed, including balanced English and Chinese text data, as well as a bilingual Flan dataset (BiFlan) that contains various types of instructions and tasks, aimed at enhancing the model's performance across different tasks. 3. **Model Training**: The paper describes a three-stage training strategy, including unsupervised pre-training (UL2), length-adaptive training, and bilingual Flan training. These stages are designed to gradually optimize the model's performance, especially in downstream tasks. 4. **Performance Evaluation**: Through evaluations on multiple benchmarks, the effectiveness and superiority of the OpenBA model are demonstrated, particularly in multilingual understanding and generative tasks. 5. **Downstream Task Adaptation**: The paper further showcases the fine-tuning effects of the OpenBA model on four specific downstream tasks, including bilingual multi-turn dialogue, code generation, instruction generation, and tool retrieval. In summary, this paper aims to promote research and development in the field of natural language processing by providing a powerful bilingual language model, particularly in the Chinese domain, while offering a robust tool for developers and researchers to support various natural language processing tasks.