Abstract:Swahili corpus is a dataset generated by collecting written Kiswahili sentences from different sectors that deals with Kiswahili documents. Corpus of intended language is needed in Natural Language Processing (NLP) task to fit algorithm in order to understand that language before training the model. Swahili corpus dataset generated contained 1,693,228 sentences with 39,639,824 words and 871,452 unique words. Corpus exported in text file format with storage size of 168 MB. These sentences collected from different sources in different categories as follows: - Health (AFYA), Business and Industries (BIASHARA), Parliament (BUNGE), Religion (DINI), Education (ELIMU), News (HABARI), Agriculture (KILIMO), Social Media (MITANDAO), Non-Governmental Organizations (MASHIRIKA YA KIRAIA), Government (SERIKALI), Laws (SHERIA) and Politics (SIASA). This abstract outlines the systematic data collection process employed for the creation of a Swahili corpus derived from multiple public websites and reports. The compilation of this corpus involves a meticulous and comprehensive approach to ensure the representation of diverse linguistic contexts and topics relevant to the Swahili language. The data collection process commenced with the identification of suitable sources across various domains, including news articles, health publications, online forums, and Governmental public reports. Websites and platforms with publicly available Swahili content were systematically crawled and archived to capture a broad spectrum of linguistic expressions. Furthermore, special attention was given to reputable sources to maintain the authenticity of the corpus and linguistic richness. The inclusion of diverse sources ensures that the corpus reflects the linguistic nuances inherent in different contexts and registers within the Swahili language. Additionally, efforts were made to incorporate variations in domain dialects, acknowledging the linguistic diversity present in Swahili. The potential for reusing this Swahili corpus is vast. Researchers, linguists, and language enthusiasts can leverage the diverse and extensive dataset for a multitude of applications, including NLP tasks such as sentiment analysis, textual data clustering, classifications tasks and machine translation. The Corpus can serve as training data for developing and evaluating NLP algorithms, including part-of-speech tagging, and named entity recognition. Also, text mining techniques can be applied to corpus and enable researchers to extract valuable insights, identify patterns, and discover knowledge from large textual datasets.

Lexical Innovation through Swahilisation of English Lexicon in Online Advertisements

Semantic features in English print advertisements: a Xitsonga translation perspective

An Optimality Theory Account of Phonological Adaptations of English Loanwords to Ng’aturukana (Turkana) Language

use of English neologisms in social media

Socio-Cultural Factors Influencing Language Use and Identity Construction Among Women in Kitui West Constituency, Kenya

Arabinglish in multilingual advertising: novel creative and innovative Arabic-English mixing practices in the Jordanian linguistic landscape

Acquisition of Translation Lexicons for Historically Unwritten Languages via Bridging Loanwords

The motif of billboard texts of adverts in four cities: Apropos of Bantu and African languages

In the heart of Swahili: An exploration of data collection methods and corpus curation for natural language processing

Semantic Shift of English Internet Slangs Used in Social Media: Morphosemantic Analysis

Code-switching functions in online advertisements on Snapchat

Exploring the Linguistic Landscape of a Global Pandemic: Covid-19 Neologisms

Crossing language boundaries. The use of English in advertisements in Polish lifestyle magazines

New English Words for Describing the International English as a Current World Language Reality

From Local to Global: Navigating Linguistic Diversity in the African Context

#FS Murayta: Lexical Features in Online Advertisements

Loanwords in Modern German: Exploring Phonetic and Grammatical Adaptations

Understanding Meaning From Online Advertisement Through Semantics Analysis of Slang (SAOS): A Case on Semantics

Uncovering SMS Spam in Swahili Text Using Deep Learning Approaches

A Morphological Analysis of Word Formation Processes in English Posters on Instagram

Lexical Features of online English Ads