Abstract:Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models. For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language, and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are: 1. **Development and Optimization of French Large - scale Pretrained Models**: - The paper introduces PAGnol, a series of French - language generation models based on the GPT architecture. By using the scaling laws, the author efficiently trained the PAGnol - XL (150 million parameters) model with limited computing resources, and its computing budget is the same as that of CamemBERT, which is 13 times smaller. - PAGnol - XL is currently the largest non - sparse French language model, and the author plans to explore larger and more powerful model versions in the future. 2. **Selection and Processing of Pretraining Datasets**: - The author emphasizes the impact of the quality of the pretraining dataset on the model output. Common datasets such as OSCAR can lead to the generation of low - quality offensive texts. Therefore, they chose the CCNet dataset, which has been filtered with high quality, for pretraining. 3. **Model Performance Evaluation**: - The author evaluated the PAGnol model on multiple discriminative tasks (such as FLUE) and generation tasks (such as FQuAD and OrangeSum summarization tasks), and compared it with other state - of - the - art French and multilingual models. PAGnol has reached a new state - of - the - art level in the summarization generation task. 4. **Application of Scaling Laws**: - The author uses the scaling laws to guide the training settings of the model in order to optimally use computing resources. PAGnol - XL achieved optimal performance with only a 3 PF - days computing budget, which is equivalent to CamemBERT. 5. **Research on Prompt Tuning**: - The author explored the application of prompt tuning technology on PAGnol, that is, inserting some random vectors into the input sequence and optimizing their values while keeping the pretrained model weights unchanged. Although it performs well on smaller models, it has encountered a performance degradation problem on PAGnol - XL, and it is suspected that it is caused by a bug in the implementation. Through these studies, the author aims to promote the development of French natural language processing, especially the performance improvement in generation tasks.

PAGnol: An Extra-Large French Generative Model

On the Multilingual Capabilities of Very Large-Scale English Language Models

Cedille: A large autoregressive French language model

LLaMA: Open and Efficient Foundation Language Models

Deploying Open-Source Large Language Models: A performance Analysis

Open Generative Large Language Models for Galician

Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster

A Retrieval-Augmented Generation Based Large Language Model Benchmarked On a Novel Dataset

CroissantLLM: A Truly Bilingual French-English Language Model

Larger-Scale Transformers for Multilingual Masked Language Modeling

GLM-130B: An Open Bilingual Pre-trained Model

Large Language Models: A Survey

Large Generative Graph Models

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

FPM: A Collection of Large-scale Foundation Pre-trained Language Models

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

mGPT: Few-Shot Learners Go Multilingual

Benchmarking Large Language Model Capabilities for Conditional Generation

Generative Model for Less-Resourced Language with 1 billion parameters