Abstract:Identifying the language spoken in an audio source is the difficult task of automatic language identification (LID) in speech processing. Short audio segments pose a significant challenge in language identification because they contain limited contextual information and fewer distinguishing features compared to longer audio samples. This lack of context makes it difficult to accurately identify the language, as the model has less data to analyse. By addressing the challenge of short-duration audio, the research aims to develop more robust and versatile language identification systems that can operate effectively even with minimal input. Another objective of the research is to address the specific challenge of identifying Indian languages accurately and efficiently from short-duration audio segments using CNNs and spectrogram representations in Python. The methodology involves several key steps: initially, audio data undergoes pre-processing to normalize the signals and reduce noise, ensuring consistency across the dataset. Subsequently, the audio signals are converted into spectrograms, which offer a visual depiction of the frequency spectrum, capturing both temporal and frequency characteristics essential for language discrimination. A CNN model is then built and trained using these spectrograms, with a specific architecture designed to extract significant features from the spectrograms. The system's performance is evaluated on a custom dataset consisting of three Indian languages: Hindi, Tamil, and Malayalam. The experimental findings show that a 98.9% accuracy rate is attained by the CNN-based model, surpassing the performance of existing models. The proposed method has potential applications in areas such as automatic speech recognition and speaker identification, where accurate and efficient language identification is crucial.

Language lexicons for Hindi-English multilingual text processing

BharatBhasaNet-A Unified Framework to Identify Indian Code Mix Languages

Language Identification of Hindi-English tweets using code-mixed BERT

Evaluating Input Representation for Language Identification in Hindi-English Code Mixed Text

Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

Language Identification of Devanagari Poems

LexGen: Domain-aware Multilingual Lexicon Generation

Multilingual text classification using deep learning

LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Feature Selection on Noisy Twitter Short Text Messages for Language Identification

LIDE: Language Identification from Text Documents

A language model based approach towards large scale and lightweight language identification systems

Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

Language Modeling for Code-Switched Data: Challenges and Approaches

SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text

Convolutional neural network based language identification system: A spectrogram based approach

All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media

Exploiting Spectral Augmentation for Code-Switched Spoken Language Identification

LSCP: Enhanced Large Scale Colloquial Persian Language Understanding

Leveraging Language Identification to Enhance Code-Mixed Text Classification