Abstract:Text categorization remains a formidable challenge in information retrieval, requiring effective strategies, especially when applied to low-resource languages such as Italian. This paper delves into the intricacies of categorizing Italian news articles, addressing the complexities arising from the language's unique structure and writing style. The implemented methodology involves preprocessing the text, generating word embeddings, conducting feature engineering to extract meaningful representations, and training a classifier using the document vectors. The evaluation of the model's performance is done on a partitioned dataset with a training set for model training and a test set for categorization, allowing assessment of its efficacy on unseen data. Within this paper, we assessed fifteen classifiers for the categorization of Italian news articles, scrutinizing eight models and three approaches for combining word embeddings to derive document vectors. We conducted a comparative analysis between established models such as Word2Vec and FastText and six novel Italian models pre-trained on native datasets. A significant highlight of our work is the introduction of an Italian GloVe model, previously absent for the Italian language. The datasets selected for testing the models' performances are DICE, a dataset of 10,395 crime news articles extracted from an Italian newspaper, and RCV2-it, a collection of 28,405 Italian news stories released by the multinational media company Reuters Ltd. The tests conducted achieved as the best F-scores 84% and 93%. The results underscore the efficacy of the Support Vector Classification algorithm, while also revealing the inefficacy of Gaussian Naive Bayes, Bernoulli Naive Bayes, and Decision Tree models within the domain of text categorization. The comparison of the word embedding models revealed the better performance of Word2Vec and GloVe concerning FastText. The broader impact of this paper lies not only in advancing text categorization methodologies for Italian documents but also in enriching the linguistic landscape by releasing six novel Italian word embedding models.

Statistical analysis of word flow among five Indo-European languages

Rank dynamics of word usage at multiple scales

Migrant mobility flows characterized with digital data

Language Statistics at Different Spatial, Temporal, and Grammatical Scales

Freshman or Fresher? Quantifying the Geographic Variation of Internet Language

Anomalous diffusion analysis of semantic evolution in major Indo-European languages

From Migration Corridors to Clusters: The Value of Google+ Data for Migration Studies

Borrowing and Contact Intensity: A Corpus-Driven Approach From Four Slavic Minority Languages

From cart to truck: meaning shift through words in English in the last two centuries

Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists

The Twitter of Babel: Mapping World Languages through Microblogging Platforms

IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces

A Comparative Analysis of Word Embeddings Techniques for Italian News Categorization

Studying word meaning evolution through incremental semantic shift detection

Bibliometric maps and co-word analysis of the literature on international cooperation on migration

The Geography of Information Diffusion in Online Discourse on Europe and Migration

Automated words stability and languages phylogeny

Studying Migrant Assimilation Through Facebook Interests

Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change

When Dialects Collide: How Socioeconomic Mixing Affects Language Use

Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death