Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

György Orosz,Gergő Szabó,Péter Berkecz,Zsolt Szántó,Richárd Farkas
DOI: https://doi.org/10.1007/978-3-031-40498-6_6
2023-08-24
Abstract:This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of improving the performance of Hungarian text processing tools, achieving a balance between resource efficiency and accuracy. Specifically, the paper introduces a series of industrial-grade text processing models developed for Hungarian, which achieve near state-of-the-art performance in all fundamental text processing steps (including tokenization, sentence boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing, and named entity recognition) while maintaining low resource consumption. The main contributions of the paper include: 1. **Model Improvements**: Significant improvements in lemmatization and dependency parsing accuracy through the introduction of new neural architectures and subword embeddings. 2. **Multi-task Learning**: Utilizing a multi-task learning framework to train the models on multiple tasks simultaneously, thereby enhancing overall performance. 3. **Resource Efficiency**: Proposing models of different sizes, including transformer-based language models, to balance between operational costs and accuracy. 4. **Open Source and Reproducibility**: All experimental results are reproducible, and the models are freely available under a permissive license, facilitating research and application. Through these improvements, the paper demonstrates the efficiency and accuracy of the new Hungarian text processing pipeline in practical applications, especially in resource-constrained industrial environments.