Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

György Orosz,Gergő Szabó,Péter Berkecz,Zsolt Szántó,Richárd Farkas

DOI: https://doi.org/10.1007/978-3-031-40498-6_6

2023-08-24

Abstract:This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of improving the performance of Hungarian text processing tools, achieving a balance between resource efficiency and accuracy. Specifically, the paper introduces a series of industrial-grade text processing models developed for Hungarian, which achieve near state-of-the-art performance in all fundamental text processing steps (including tokenization, sentence boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing, and named entity recognition) while maintaining low resource consumption. The main contributions of the paper include: 1. **Model Improvements**: Significant improvements in lemmatization and dependency parsing accuracy through the introduction of new neural architectures and subword embeddings. 2. **Multi-task Learning**: Utilizing a multi-task learning framework to train the models on multiple tasks simultaneously, thereby enhancing overall performance. 3. **Resource Efficiency**: Proposing models of different sizes, including transformer-based language models, to balance between operational costs and accuracy. 4. **Open Source and Reproducibility**: All experimental results are reproducible, and the models are freely available under a permissive license, facilitating research and application. Through these improvements, the paper demonstrates the efficiency and accuracy of the new Hungarian text processing pipeline in practical applications, especially in resource-constrained industrial environments.

Advancing Hungarian Text Processing with HuSpaCy: Efficient and Accurate NLP Pipelines

HuSpaCy: an industrial-strength Hungarian natural language processing toolkit

LatinCy: Synthetic Trained Pipelines for Latin NLP

Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy

"Approaches to sentiment analysis of Hungarian political news at the sentence level"

ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing

BEA-Base: A Benchmark for ASR of Spontaneous Hungarian

Developing neural machine translation models for Hungarian-English

From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization

Syntax-based data augmentation for Hungarian-English machine translation

Biomedical and clinical English model packages for the Stanza Python NLP library

Neural machine translation for Hungarian

MAKING USE OF A ‘SPACY’ MODULE IN THE NATURAL LANGUAGE PROCESSING

Revisiting Supertagging for Faster HPSG Pasing

A State-of-the-Art Morphosyntactic Parser and Lemmatizer for Ancient Greek

Spaiche: Extending State-of-the-Art ASR Models to Swiss German Dialects

Exploiting limited data for parsing

Textflows: an open science NLP evaluation approach

Pipeline and Dataset Generation for Automated Fact-checking in Almost Any Language

What's Wrong with Hebrew NLP? And How to Make it Right

Growing Networks – Modelling the Growth of Word Association Networks for Hungarian and English