Multi-class Sports News Categorization using Machine Learning Techniques: Resource Creation and Evaluation

Adrita Barua,Omar Sharif,Mohammed Moshiul Hoque
DOI: https://doi.org/10.1016/j.procs.2021.11.002
2021-01-01
Procedia Computer Science
Abstract:The proliferation of the Internet and social media usage creates enormous textual data (specifically, news content) on the web. The most proportion of contents primarily are unstructured. Extracting meaningful insights from unstructured content is nearly impossible or extremely hard, and time-consuming by human labor. Thus, automatic text classification has gained much attention from NLP experts in recent years. Several techniques have been developed to classify news text in high resource languages (e.g., English, Chinese, French). However, the automatic classification of Bengali news text is in a primitive stage to date. This paper investigates the six most popular machine learning techniques (such as Logistic Regression (LR), Support Vector Classifier (SVC), Decision Tree (DT), Multinomial Naive Bayes (MNB), Random Forest (RF), etc.) with Term Frequency-Inverse Document Frequency (TF-IDF) features for automatic sports news classification in Bengali. Due to the unavailability of benchmark corpus, this work also developed a Bengali news corpus (called BNeC) consisting of 43306 news documents with 202830 unique words in multiple classes: Cricket, Football, Tennis, and Athletics. Experimental results on the test dataset show that the Support Vector Classifier (SVC) with unigram+bigram+trigram feature space obtained the highest weighted f1-score of 97.60% than the other classifiers and feature combinations.
What problem does this paper attempt to address?