Abstract:Automatic text summarization (ATS) provides a summary of distinct categories of information using natural language processing (NLP). Low-resource languages like Hindi have restricted applications of these techniques. This study proposes a method for automatically generating summaries of Hindi documents using extractive technique. The approach retrieves pertinent sentences from the source documents by employing multiple linguistic features and machine learning (ML) using maximum likelihood estimation (MLE) and maximum entropy (ME). We conducted pre-processing on the input documents, such as eliminating Hindi stop words and stemming. We have obtained 15 linguistic feature scores from each document to identify the phrases with high scores for summary generation. We have performed experiments over BBC News articles, CNN News, DUC 2004, Hindi Text Short Summarization Corpus, Indian Language News Text Summarization Corpus, and Wikipedia Articles for the proposed text summarizer. The Hindi Text Short Summarization Corpus and Indian Language News Text Summarization Corpus datasets are in Hindi, whereas BBC News articles, CNN News, and the DUC 2004 datasets have been translated into Hindi using Google, Microsoft Bing, and Systran translators for experiments. The summarization results have been calculated and shown for Hindi as well as for English to compare the performance of a low and rich-resource language. Multiple ROUGE metrics, along with precision, recall, and F-measure, have been used for the evaluation, which shows the better performance of the proposed method with multiple ROUGE scores. We compare the proposed method with the supervised and unsupervised machine learning methodologies, including support vector machine (SVM), Naive Bayes (NB), decision tree (DT), latent semantic analysis (LSA), latent Dirichlet allocation (LDA), and K-means clustering, and it was found that the proposed method outperforms these methods.

Development of Automatic Rule-based Semantic Tagger and Karaka Analyzer for Hindi

FST Based Morphological Analyzer for Hindi Language

MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi

Taxonomic survey of Hindi Language NLP systems

SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text

HKG: A Novel Approach for Low Resource Indic Languages to Automatic Knowledge Graph Construction

Sanskrit Knowledge-based Systems: Annotation and Computational Tools

Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

SAHAAYAK 2023 -- the Multi Domain Bilingual Parallel Corpus of Sanskrit to Hindi for Machine Translation

Development of a Hindi Lemmatizer

Automatic Speech Recognition for Hindi

Structural analysis of Hindi online handwritten characters for character recognition

Contextual Mood Analysis with Knowledge Graph Representation for Hindi Song Lyrics in Devanagari Script

Part-of-Speech Tagging for Code-mixed Indian Social Media Text at ICON 2015

A POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam

Multi Task Deep Morphological Analyzer: Context Aware Joint Morphological Tagging and Lemma Prediction

Toward Integrated CNN-based Sentiment Analysis of Tweets for Scarce-resource Language—Hindi

SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface for Pedagogical and Annotation Purposes

H-AES: Towards Automated Essay Scoring for Hindi

Improving neural machine translation for low-resource Indian languages using rule-based feature extraction

Automatic Extractive Text Summarization using Multiple Linguistic Features