HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

György Orosz,Zsolt Szántó,Péter Berkecz,Gergő Szabó,Richárd Farkas
DOI: https://doi.org/10.48550/arXiv.2201.01956
2022-01-11
Abstract:Although there are a couple of open-source language processing pipelines available for Hungarian, none of them satisfies the requirements of today's NLP applications. A language processing pipeline should consist of close to state-of-the-art lemmatization, morphosyntactic analysis, entity recognition and word embeddings. Industrial text processing applications have to satisfy non-functional software quality requirements, what is more, frameworks supporting multiple languages are more and more favored. This paper introduces HuSpaCy, an industry-ready Hungarian language processing toolkit. The presented tool provides components for the most important basic linguistic analysis tasks. It is open-source and is available under a permissive license. Our system is built upon spaCy's NLP components resulting in an easily usable, fast yet accurate application. Experiments confirm that HuSpaCy has high accuracy while maintaining resource-efficient prediction capabilities.
Computation and Language,Machine Learning
What problem does this paper attempt to address?