BharatBhasaNet-A Unified Framework to Identify Indian Code Mix Languages
Sayantan Dey,Shivam Thakur,Akhilesh Kandwal,Rohit Kumar,Sharmistha Dasgupta,Partha Pratim Roy
DOI: https://doi.org/10.1109/access.2024.3396290
IF: 3.9
2024-05-22
IEEE Access
Abstract:In the rapidly globalizing digital communication sphere, the imperative for advanced multilingual text recognition and identification is increasingly evident. Contrasting the previous works, which were predominantly constrained to 2-3 languages, this paper explores the rich linguistic diversity of India, addressing challenges in automated language processing for 12 languages. BharatBhasaNet, our comprehensive Language Identification (LID) framework, integrates an extensive dataset covering these 12 Indian languages in both native-script and romanized forms, derived from INDICCORP, Bhasha-Abhijnaanam, and Aksharantar datasets by AI4Bharat. The framework accommodates two models, Roberta-native and Roberta-Romanized, based on attention mechanism and transformer architecture. With its exceptional accuracy of 99.54% in native script and 60.90% in Romanized text, BharatBhasaNet significantly advances language identification, providing broader language coverage than existing LIDs. It excels in interpreting code-mixed sentences, unveiling crucial accuracy patterns related to sentence length, word span, and complexity in multilingual contexts. The framework underwent rigorous testing using a real-time dataset from the National Informatics Center (NIC), achieving an accuracy rate of 92.67%. Overcoming challenges like limited training data and distinguishing similar languages, BharatBhasaNet marks a significant leap in Romanized text identification within diverse linguistic landscapes.
computer science, information systems,telecommunications,engineering, electrical & electronic