Abstract:With the advancements in large language model technology, it has showcased capabilities that come close to those of human beings across various tasks. This achievement has garnered significant interest from companies and scientific research institutions, leading to substantial investments in the research and development of these models. While numerous large models have emerged during this period, the majority of them have been trained primarily on English data. Although they exhibit decent performance in other languages, such as Chinese, their potential remains limited due to factors like vocabulary design and training corpus. Consequently, their ability to fully express their capabilities in Chinese falls short. To address this issue, we introduce the model named JIANG (Chinese pinyin of ginger) specifically designed for the Chinese language. We have gathered a substantial amount of Chinese corpus to train the model and have also optimized its structure. The extensive experimental results demonstrate the excellent performance of our model.

What problem does this paper attempt to address?

The main goal of this paper is to introduce a large-scale language model specifically designed for the Chinese language environment—JIANG. Given that most current large-scale language models are primarily trained on English corpora and thus perform suboptimally on Chinese tasks, the authors aim to address this limitation by constructing a large-scale language model focused on Chinese. Specifically, the paper addresses the following key issues: 1. **Language model optimized for Chinese**: Most large language models on the market are primarily trained using English datasets, which leads to poor performance when handling Chinese tasks. Therefore, it is particularly necessary to develop a model optimized specifically for Chinese. 2. **High-quality Chinese corpus**: To train a high-performance Chinese model, the authors collected a large amount of Chinese corpora, including internet texts, Wikipedia, financial data, etc., and conducted strict quality control on these data. 3. **Optimization of model structure and training techniques**: JIANG adopts a network design based on the Transformer architecture and introduces a series of innovations on this basis, such as partially removing bias terms in fully connected layers, using RMSNorm layers, and introducing gating mechanisms, to improve the model's performance and generalization ability. 4. **Experimental validation**: The paper also provides detailed experimental results, demonstrating JIANG's superior performance on multiple Chinese natural language processing tasks, especially in inference tasks, where it shows a significant advantage over other models. In summary, this research is dedicated to enhancing the capability and level of Chinese natural language processing by developing a high-quality language model specifically designed for Chinese.

JIANG: Chinese Open Foundation Language Model

YuLan: An Open-source Large Language Model

Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence

Panda LLM: Training Data and Evaluation for Open-Sourced Chinese Instruction-Following Large Language Models

YAYI 2: Multilingual Open-Source Large Language Models

Yi: Open Foundation Models by 01.AI

FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models

An Improved Traditional Chinese Evaluation Suite for Foundation Model

Skywork: A More Open Bilingual Foundation Model

ZhongJing: A Locally Deployed Large Language Model for Traditional Chinese Medicine and Corresponding Evaluation Methodology: A Large Language Model for data fine-tuning in the field of Traditional Chinese Medicine, and a new evaluation method called TCMEval are proposed

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Qwen Technical Report

WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models

Baichuan 2: Open Large-scale Language Models

Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models

Xmodel-1.5: An 1B-scale Multilingual LLM

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

JoyHallo: Digital human model for Mandarin

LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model