RoGPT2: Romanian GPT2 for Text Generation

Mihai Alexandru Niculescu,Stefan Ruseti,Mihai Dascalu
DOI: https://doi.org/10.1109/ictai52525.2021.00183
2021-11-01
Abstract:Text generation is one of the most important and challenging tasks in NLP, where models have shown a significant performance increase in recent years. However, most generative models are available only for English, whereas low-resource languages like Romanian have no available alternatives. As such, we introduce RoGPT2, a Romanian version of the GPT2 model, trained on the largest corpus available for the Romanian language. Three versions of the model were trained, namely base (124M parameters), medium (354M parameters), and large (774M parameters). Six tasks from the LiRo benchmark were selected to test the performance and limitations of our encoder versus BERT-Base models for Romanian (RoBERT, BERT-ro-base, and RoDiBERT). RoGPT2 manages to achieve similar or even better performance, except for the task of zero-shot learning cross-lingual question answering. RoGPT2 also obtains state-of-the-art results for grammar error correction (RoGEC) using the RONACC corpus, thus arguing for the model’s capability to generate grammatically correct text (F0.5 = 69.01). In addition, we introduce two use cases in which we showcase the different versions and explore the extent to which RoGPT2 is able to continue Romanian news articles. After fine-tuning, the model generated rather long text which accounts for the context of the news.
What problem does this paper attempt to address?