Abstract:Lately, pre-trained language models advanced the field of natural language processing (NLP). The introduction of Bidirectional Encoders for Transformers (BERT) and its optimized version RoBERTa have had significant impact and increased the relevance of pre-trained models. First, research in this field mainly started on English data followed by models trained with multilingual text corpora. However, current research shows that multilingual models are inferior to monolingual models. Currently, no German single language RoBERTa model is yet published, which we introduce in this work (GottBERT). The German portion of the OSCAR data set was used as text corpus. In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones. GottBERT was pre-trained related to the original RoBERTa model using fairseq. All downstream tasks were trained using hyperparameter presets taken from the benchmark of German BERT. The experiments were setup utilizing FARM. Performance was measured by the $F_{1}$ score. GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture. Even without extensive hyper-parameter optimization, in all NER and one text classification task, GottBERT already outperformed all other tested German and multilingual models. In order to support the German NLP field, we publish GottBERT under the AGPLv3 license.

RoBERTurk: Adjusting RoBERTa for Turkish

TurkishBERTweet: Fast and reliable large language model for social media analysis

Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks

LegalTurk Optimized BERT for Multi-Label Text Classification and NER

BERT2D: Two Dimensional Positional Embeddings for Efficient Turkish NLP

Comparison of Pre-trained Language Models for Turkish Address Parsing

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models

RoBERTa: A Robustly Optimized BERT Pretraining Approach

GottBERT: a pure German Language Model

Turkish Text Retrieval Experiments Using Lemur Toolkit

RoBERTuito: a pre-trained language model for social media text in Spanish

Data and Representation for Turkish Natural Language Inference

EstBERT: A Pretrained Language-Specific BERT for Estonian

Scaling BERT Models for Turkish Automatic Punctuation and Capitalization Correction

RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use

Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training

VBART: The Turkish LLM

WangchanBERTa: Pretraining transformer-based Thai Language Models

Performance Comparison of Pre-trained Models for Speech-to-Text in Turkish: Whisper-Small and Wav2Vec2-XLS-R-300M

Turkish Medical Text Classification Using BERT