Abstract:With recent advancements in the area of Natural Language Processing, the focus is slowly shifting from a purely English-centric view towards more language-specific solutions, including German. Especially practical for businesses to analyze their growing amount of textual data are text summarization systems, which transform long input documents into compressed and more digestible summary texts. In this work, we assess the particular landscape of German abstractive text summarization and investigate the reasons why practically useful solutions for abstractive text summarization are still absent in industry. Our focus is two-fold, analyzing a) training resources, and b) publicly available summarization systems. We are able to show that popular existing datasets exhibit crucial flaws in their assumptions about the original sources, which frequently leads to detrimental effects on system generalization and evaluation biases. We confirm that for the most popular training dataset, MLSUM, over 50% of the training set is unsuitable for abstractive summarization purposes. Furthermore, available systems frequently fail to compare to simple baselines, and ignore more effective and efficient extractive summarization approaches. We attribute poor evaluation quality to a variety of different factors, which are investigated in more detail in this work: A lack of qualitative (and diverse) gold data considered for training, understudied (and untreated) positional biases in some of the existing datasets, and the lack of easily accessible and streamlined pre-processing strategies or analysis tools. We provide a comprehensive assessment of available models on the cleaned datasets, and find that this can lead to a reduction of more than 20 ROUGE-1 points during evaluation. The code for dataset filtering and reproducing results can be found online at <a class="link-external link-https" href="https://github.com/dennlinger/summaries" rel="external noopener nofollow">this https URL</a>

Language Models for German Text Simplification: Overcoming Parallel Data Scarcity through Style-specific Pre-training

German Text Simplification: Finetuning Large Language Models with Semi-Synthetic Data

Data and Approaches for German Text simplification -- towards an Accessibility-enhanced Communication

DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

Exploring Automatic Text Simplification of German Narrative Documents

Mitigating Data Scarcity for Large Language Models

Investigating Text Simplification Evaluation

Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding

Historical German Text Normalization Using Type- and Token-Based Language Modeling

LLäMmlein: Compact and Competitive German-Only Language Models from Scratch

Exploiting Biased Models to De-bias Text: A Gender-Fair Rewriting Model

EASSE-DE: Easier Automatic Sentence Simplification Evaluation for German

Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Efficient Speech Translation with Pre-trained Models

Pre-Training a Language Model Without Human Language

Enhancing Pre-trained Language Model with Lexical Simplification

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

On the State of German (Abstractive) Text Summarization

On the comparability of Pre-trained Language Models

Translating away Translationese without Parallel Data