Abstract:Representing words and phrases into dense vectors of real numbers which encode semantic and syntactic properties is a vital constituent in natural language processing (NLP). The success of neural network (NN) models in NLP largely rely on such dense word representations learned on the large unlabeled corpus. Sindhi is one of the rich morphological language, spoken by large population in Pakistan and India lacks corpora which plays an essential role of a test-bed for generating word embeddings and developing language independent NLP systems. In this paper, a large corpus of more than 61 million words is developed for low-resourced Sindhi language for training neural word embeddings. The corpus is acquired from multiple web-resources using web-scrappy. Due to the unavailability of open source preprocessing tools for Sindhi, the prepossessing of such large corpus becomes a challenging problem specially cleaning of noisy data extracted from web resources. Therefore, a preprocessing pipeline is employed for the filtration of noisy text. Afterwards, the cleaned vocabulary is utilized for training Sindhi word embeddings with state-of-the-art GloVe, Skip-Gram (SG), and Continuous Bag of Words (CBoW) word2vec algorithms. The intrinsic evaluation approach of cosine similarity matrix and WordSim-353 are employed for the evaluation of generated Sindhi word embeddings. Moreover, we compare the proposed word embeddings with recently revealed Sindhi fastText (SdfastText) word representations. Our intrinsic evaluation results demonstrate the high quality of our generated Sindhi word embeddings using SG, CBoW, and GloVe as compare to SdfastText word representations.

Development of Word Embeddings for Uzbek Language

Co-occurrences using Fasttext embeddings for word similarity tasks in Urdu

Construction of an English-Uyghur WordNet Dataset.

Creating a morphological and syntactic tagged corpus for the Uzbek language

Fast Extraction of Word Embedding from Q-contexts

UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language

word representation or word embedding in Persian text

Design and Implementation of a Tool for Extracting Uzbek Syllables

UzMorphAnalyser: A Morphological Analysis Model for the Uzbek Language Using Inflectional Endings

HJ-Ky-0.1: an Evaluation Dataset for Kyrgyz Word Embeddings

Text classification dataset and analysis for Uzbek language

Uzbek text summarization based on TF-IDF

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

A Precisely Xtreme-Multi Channel Hybrid Approach For Roman Urdu Sentiment Analysis

Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Uzbek affix finite state machine for stemming

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

Word Embedding based New Corpus for Low-resourced Language: Sindhi

Parallel texts dataset for Uzbek-Kazakh machine translation

Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model

Field Embedding: A Unified Grain-Based Framework for Word Representation