Abstract:South Africa and the Democratic Republic of Congo (DRC) present a complex linguistic landscape with languages such as Zulu, Sepedi, Afrikaans, French, English, and Tshiluba (Ciluba), which creates unique challenges for AI-driven translation and sentiment analysis systems due to a lack of accurately labeled data. This study seeks to address these challenges by developing a multilingual lexicon designed for French and Tshiluba, now expanded to include translations in English, Afrikaans, Sepedi, and Zulu. The lexicon enhances cultural relevance in sentiment classification by integrating language-specific sentiment scores. A comprehensive testing corpus is created to support translation and sentiment analysis tasks, with machine learning models such as Random Forest, Support Vector Machine (SVM), Decision Trees, and Gaussian Naive Bayes (GNB) trained to predict sentiment across low resource languages (LRLs). Among them, the Random Forest model performed particularly well, capturing sentiment polarity and handling language-specific nuances effectively. Furthermore, Bidirectional Encoder Representations from Transformers (BERT), a Large Language Model (LLM), is applied to predict context-based sentiment with high accuracy, achieving 99% accuracy and 98% precision, outperforming other models. The BERT predictions were clarified using Explainable AI (XAI), improving transparency and fostering confidence in sentiment classification. Overall, findings demonstrate that the proposed lexicon and machine learning models significantly enhance translation and sentiment analysis for LRLs in South Africa and the DRC, laying a foundation for future AI models that support underrepresented languages, with applications across education, governance, and business in multilingual contexts.

Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages

DN at SemEval-2023 Task 12: Low-Resource Language Text Classification via Multilingual Pretrained Language Model Fine-tuning

University of Cape Town's WMT22 System: Multilingual Machine Translation for Southern African Languages

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

The University of Helsinki submissions to the WMT19 news translation task

A Case Study on Filtering for End-to-End Speech Translation

Toucan: Many-to-Many Translation for 150 African Language Pairs

NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis

A Data Selection Approach for Enhancing Low Resource Machine Translation Using Cross-Lingual Sentence Representations

UBC-DLNLP at SemEval-2023 Task 12: Impact of Transfer Learning on African Sentiment Analysis

UIO at SemEval-2023 Task 12: Multilingual fine-tuning for sentiment classification in low-resource languages

Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model

Low-Resourced Machine Translation for Senegalese Wolof Language

AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages

Low-Resource Neural Machine Translation for Southern African Languages

Microsoft's Submission to the WMT2018 News Translation Task: How I Learned to Stop Worrying and Love the Data

A Multilingual Sentiment Lexicon for Low-Resource Language Translation using Large Languages Models and Explainable AI

Masakhane-Afrisenti at SemEval-2023 Task 12: Sentiment Analysis using Afro-centric Language Models and Adapters for Low-resource African Languages

Multilingual Multimodal Learning with Machine Translated Text

Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya

Findings of the Covid-19 MLIA Machine Translation Task