Universal Language Model Fine-Tuning with Subword Tokenization for Polish

Piotr Czapla,Jeremy Howard,Marcin Kardas
DOI: https://doi.org/10.48550/arXiv.1810.10222
2018-10-24
Abstract:Universal Language Model for Fine-tuning [<a class="link-https" data-arxiv-id="1801.06146" href="https://arxiv.org/abs/1801.06146">arXiv:1801.06146</a>] (ULMFiT) is one of the first NLP methods for efficient inductive transfer learning. Unsupervised pretraining results in improvements on many NLP tasks for English. In this paper, we describe a new method that uses subword tokenization to adapt ULMFiT to languages with high inflection. Our approach results in a new state-of-the-art for the Polish language, taking first place in Task 3 of PolEval'18. After further training, our final model outperformed the second best model by 35%. We have open-sourced our pretrained models and code.
Computation and Language,Machine Learning
What problem does this paper attempt to address?