Abstract:Abstract Punctuation marks play a vital role in text representation and interpretation, and are useful in enhancing the performance of modern Natural Language Processing (NLP) systems such as voice input typing aids, machine translation, and speech synthesis systems. Punctuation marks, except period, are inherently not available in Indian languages such as Tamil and Hindi. However, some modern forms of writing such as news articles, blogs, stories, and so forth, incorporate user‐defined punctuation marks in these languages. The current work proposes an automatic punctuation prediction system for texts in Tamil and Hindi using classification approach, where punctuation prediction is considered as a multi‐class classification problem. Word‐level text features are chosen and are analysed to validate their language‐dependency and significance towards punctuation prediction. A Feature‐weighted AdaBoost (FAda) classifier is proposed that defines a novel boosting factor to adjust the hypothesis weight of the weak classifiers, hence reducing the number of false classifications. It is observed that the proposed classifier outperforms the other classification techniques such as, AdaBoost, SVM, CART, CRF, and Bi‐LSTM by a maximum difference of 50% and 16% in the macro F1‐scores for Tamil and Hindi texts, respectively. The proposed classifier performs on par with the attention‐based classifier for both Tamil and Hindi texts. Further, as a proof of concept, the proposed punctuation prediction system is applied to voice keyboard, machine translation, and speech synthesis systems, to validate the effect of the punctuation marks on the performance of these Natural Language Processing (NLP) systems.

Self-Attention Based Model For Punctuation Prediction Using Word And Speech Embeddings

Incorporating External POS Tagger for Punctuation Restoration

Focal Loss for Punctuation Prediction.

Improve Word Embedding Using Both Writing and Pronunciation.

Distilling Knowledge from an Ensemble of Models for Punctuation Prediction.

A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

Multimodal Punctuation Prediction with Contextual Dropout

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation

Adversarial Transfer Learning for Punctuation Restoration

Question Mark Prediction by Bert

Predicting Punctuation in Ancient Chinese Texts: A Multi-Layered LSTM and Attention-Based Approach

Automatic punctuation generation for speech

Efficient Ensemble for Multimodal Punctuation Restoration using Time-Delay Neural Network

Punctuation as implicit annotations for chinese word segmentation

Transfer knowledge for punctuation prediction via adversarial training

Feature‐weighted AdaBoost classifier for punctuation prediction in Tamil and Hindi NLP systems

A Context-Aware Feature Fusion Framework for Punctuation Restoration

Unified Multimodal Punctuation Restoration Framework for Mixed-Modality Corpus

A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR

Punctuation Prediction for Polish Texts using Transformers