A Survey of Large Language Models for European Languages

Wazir Ali,Sampo Pyysalo
2024-08-28
Abstract:Large Language Models (LLMs) have gained significant attention due to their high performance on a wide range of natural language tasks since the release of ChatGPT. The LLMs learn to understand and generate language by training billions of model parameters on vast volumes of text data. Despite being a relatively new field, LLM research is rapidly advancing in various directions. In this paper, we present an overview of LLM families, including LLaMA, PaLM, GPT, and MoE, and the methods developed to create and enhance LLMs for official European Union (EU) languages. We provide a comprehensive summary of common monolingual and multilingual datasets used for pretraining large language models.
Computation and Language
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the development and application issues of large language models (LLMs) in official European languages. Specifically: - **Overview of the LLM family**: The paper provides an overview of the family of large language models, including LLaMA, PaLM, GPT, and MoE, and explores the methods for creating and enhancing these models. - **Pre-training datasets**: It details the common monolingual and multilingual datasets used for pre-training LLMs, with a particular focus on datasets for official European languages. - **Classification of language resources**: Based on the availability of language resources (such as large-scale unannotated corpora, annotated datasets, and language tools required for NLP tasks), EU languages are classified into low-resource, medium-resource, and high-resource languages. Through these efforts, the paper aims to fill the existing research gap regarding the development and resources of LLMs for European languages, providing a comprehensive reference framework for future researchers.