Abstract:South Africa and the Democratic Republic of Congo (DRC) present a complex linguistic landscape with languages such as Zulu, Sepedi, Afrikaans, French, English, and Tshiluba (Ciluba), which creates unique challenges for AI-driven translation and sentiment analysis systems due to a lack of accurately labeled data. This study seeks to address these challenges by developing a multilingual lexicon designed for French and Tshiluba, now expanded to include translations in English, Afrikaans, Sepedi, and Zulu. The lexicon enhances cultural relevance in sentiment classification by integrating language-specific sentiment scores. A comprehensive testing corpus is created to support translation and sentiment analysis tasks, with machine learning models such as Random Forest, Support Vector Machine (SVM), Decision Trees, and Gaussian Naive Bayes (GNB) trained to predict sentiment across low resource languages (LRLs). Among them, the Random Forest model performed particularly well, capturing sentiment polarity and handling language-specific nuances effectively. Furthermore, Bidirectional Encoder Representations from Transformers (BERT), a Large Language Model (LLM), is applied to predict context-based sentiment with high accuracy, achieving 99% accuracy and 98% precision, outperforming other models. The BERT predictions were clarified using Explainable AI (XAI), improving transparency and fostering confidence in sentiment classification. Overall, findings demonstrate that the proposed lexicon and machine learning models significantly enhance translation and sentiment analysis for LRLs in South Africa and the DRC, laying a foundation for future AI models that support underrepresented languages, with applications across education, governance, and business in multilingual contexts.

Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context

AfriHuBERT: A self-supervised speech representation model for African languages

Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

mHuBERT-147: A Compact Multilingual HuBERT Model

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information

End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining.

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Improved Speech Pre-Training with Supervision-Enhanced Acoustic Unit

A Multilingual Sentiment Lexicon for Low-Resource Language Translation using Large Languages Models and Explainable AI

UCAS-IIE-NLP at SemEval-2023 Task 12: Enhancing Generalization of Multilingual BERT for Low-resource Sentiment Analysis

Semantic enrichment towards efficient speech representations

Sustainable self-supervised learning for speech representations

Towards Robust Speech Representation Learning for Thousands of Languages

Spatial HuBERT: Self-supervised Spatial Speech Representation Learning for a Single Talker from Multi-channel Audio

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Semi-Supervised Spoken Language Understanding Via Self-Supervised Speech and Language Model Pretraining.

DN at SemEval-2023 Task 12: Low-Resource Language Text Classification via Multilingual Pretrained Language Model Fine-tuning

Self-supervised Adaptive Pre-training of Multilingual Speech Models for Language and Dialect Identification