Abstract:Text categorization remains a formidable challenge in information retrieval, requiring effective strategies, especially when applied to low-resource languages such as Italian. This paper delves into the intricacies of categorizing Italian news articles, addressing the complexities arising from the language's unique structure and writing style. The implemented methodology involves preprocessing the text, generating word embeddings, conducting feature engineering to extract meaningful representations, and training a classifier using the document vectors. The evaluation of the model's performance is done on a partitioned dataset with a training set for model training and a test set for categorization, allowing assessment of its efficacy on unseen data. Within this paper, we assessed fifteen classifiers for the categorization of Italian news articles, scrutinizing eight models and three approaches for combining word embeddings to derive document vectors. We conducted a comparative analysis between established models such as Word2Vec and FastText and six novel Italian models pre-trained on native datasets. A significant highlight of our work is the introduction of an Italian GloVe model, previously absent for the Italian language. The datasets selected for testing the models' performances are DICE, a dataset of 10,395 crime news articles extracted from an Italian newspaper, and RCV2-it, a collection of 28,405 Italian news stories released by the multinational media company Reuters Ltd. The tests conducted achieved as the best F-scores 84% and 93%. The results underscore the efficacy of the Support Vector Classification algorithm, while also revealing the inefficacy of Gaussian Naive Bayes, Bernoulli Naive Bayes, and Decision Tree models within the domain of text categorization. The comparison of the word embedding models revealed the better performance of Word2Vec and GloVe concerning FastText. The broader impact of this paper lies not only in advancing text categorization methodologies for Italian documents but also in enriching the linguistic landscape by releasing six novel Italian word embedding models.

Non-Standard Words as Features for Text Categorization

Normalization of Non-Standard Words in Croatian Texts

CLDA: Feature Selection for Text Categorization Based on Constrained LDA

Albanian Text Classification: Bag of Words Model and Word Analogies

Non-Negative Sparse Semantic Coding for Text Categorization

Aggressive Dimensionality Reduction With Reinforcement Local Feature Selection For Text Categorization

A preliminary study of Croatian Language Syllable Networks

Text Categorization Can Enhance Domain-Agnostic Stopword Extraction

Lexical Diversity As a Lens into the Classification of Slavic Languages: A Quantitative Typology Perspective.

About Methods for Classifying Hidden Language Concepts in Specialized Texts Involving Pseudoinverse, Clustering and Data Grouping

New Textual Corpora for Serbian Language Modeling

Initial Comparison of Linguistic Networks Measures for Parallel Texts

CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

Toward Selectivity Based Keyword Extraction for Croatian News

Semantic similarity-aware feature selection and redundancy removal for text classification using joint mutual information

Optimizing Text Clustering Efficiency through Flexible Latent Dirichlet Allocation Method: Exploring the Impact of Data Features and Threshold Modification

A Non-VSM kNN algorithm for text classification

An Effective Feature Selection Method For Text Categorization

Domain and Language Independent Feature Extraction for Statistical Text Categorization

A Comparative Analysis of Word Embeddings Techniques for Italian News Categorization

Normalization of Lithuanian Text Using Regular Expressions