Abstract:Large language models (LLMs) with billions of parameters have demonstrated outstanding performance on various natural language processing tasks. This report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model, to contribute an LLM variant to the Chinese-oriented open-source model community. We enhance OpenBA with effective and efficient techniques as well as adopt a three-stage training strategy to train the model from scratch. Our solution can also achieve very competitive performance with only 380B tokens, which is better than LLaMA-70B on the BELEBELE benchmark, BLOOM-176B on the MMLU benchmark, GLM-130B on the C-Eval (hard) benchmark. This report provides the main details to pre-train an analogous model, including pre-training data processing, Bilingual Flan data collection, the empirical observations that inspire our model architecture design, training objectives of different stages, and other enhancement techniques. Additionally, we also provide the fine-tuning details of OpenBA on four downstream tasks. We have refactored our code to follow the design principles of the Huggingface Transformers Library, making it more convenient for developers to use, and released checkpoints of different training stages at <a class="link-external link-https" href="https://huggingface.co/openBA" rel="external noopener nofollow">this https URL</a>. More details of our project are available at <a class="link-external link-https" href="https://github.com/OpenNLG/openBA.git" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main goal of this paper is to propose a new open-source, bilingual asymmetric sequence-to-sequence (seq2seq) model named OpenBA, which has 15 billion parameters and is pre-trained from scratch. OpenBA aims to fill the gap in the current open-source community for high-quality large language models in the Chinese domain, especially models based on the encoder-decoder architecture. Specifically, this study addresses the following key issues: 1. **Model Contribution**: As a large language model, OpenBA particularly emphasizes its application capabilities in the Chinese domain and its asymmetric encoder-decoder structural design, which helps improve the ability to perform generative tasks. 2. **Dataset Construction**: The paper details how the data used for pre-training was collected and processed, including balanced English and Chinese text data, as well as a bilingual Flan dataset (BiFlan) that contains various types of instructions and tasks, aimed at enhancing the model's performance across different tasks. 3. **Model Training**: The paper describes a three-stage training strategy, including unsupervised pre-training (UL2), length-adaptive training, and bilingual Flan training. These stages are designed to gradually optimize the model's performance, especially in downstream tasks. 4. **Performance Evaluation**: Through evaluations on multiple benchmarks, the effectiveness and superiority of the OpenBA model are demonstrated, particularly in multilingual understanding and generative tasks. 5. **Downstream Task Adaptation**: The paper further showcases the fine-tuning effects of the OpenBA model on four specific downstream tasks, including bilingual multi-turn dialogue, code generation, instruction generation, and tool retrieval. In summary, this paper aims to promote research and development in the field of natural language processing by providing a powerful bilingual language model, particularly in the Chinese domain, while offering a robust tool for developers and researchers to support various natural language processing tasks.

OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch

OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Bailong: Bilingual Transfer Learning based on QLoRA and Zip-tie Embedding

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Baichuan 2: Open Large-scale Language Models

YuLan: An Open-source Large Language Model

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

PolyLM: An Open Source Polyglot Large Language Model

Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models

BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages

SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding

Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca

LOLA -- An Open-Source Massively Multilingual Large Language Model

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

Tele-FLM Technical Report

GLM-130B: An Open Bilingual Pre-trained Model

Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

YAYI 2: Multilingual Open-Source Large Language Models