Abstract:A phonemicization or grapheme-to-phoneme conversion (G2P) is a process of converting a word into its pronunciation. It is one of the essential components in speech synthesis, speech recognition, and natural language processing. The deep learning (DL)-based state-of-the-art G2P model generally gives low phoneme error rate (PER) as well as word error rate (WER) for high-resource languages, such as English and European, but not for low-resource languages. Therefore, some conventional machine learning (ML)-based G2P models incorporated with specific linguistic knowledge are preferable for low-resource languages. However, these models are poor for several low-resource languages because of various issues. For instance, an Indonesian G2P model works well for roots but gives a high PER for derivatives. Most errors come from the ambiguities of some roots and derivative words containing four prefixes: <math>〈</math>ber<math>〉,〈</math>meng<math>〉,〈</math>peng<math>〉</math>, and <math>〈</math>ter<math>〉</math>. In this research, an Indonesian G2P model based on n-gram combined with stemmer and phonotactic rules (NGTSP) is proposed to solve those problems. An investigation based on 5-fold cross-validation, using 50 k Indonesian words, informs that the proposed NGTSP gives a much lower PER of 0.78% than the state-of-the-art Transformer-based G2P model (1.14%). Besides, it also provides a much faster processing time.

Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging

Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger

Stemmer and phonotactic rules to improve n-gram tagger-based indonesian phonemicization

Yunshan Cup 2020: Overview of the Part-of-Speech Tagging Task for Low-resourced Languages

Hybrid Models for Emotion Classification and Sentiment Analysis in Indonesian Language

A POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam

AsPOS: Assamese Part of Speech Tagger using Deep Learning Approach

ACO-tagger: A Novel Method for Part-of-Speech Tagging using Ant Colony Optimization

Probing a pretrained RoBERTa on Khasi language for POS tagging

Recurrent Neural Network based Part-of-Speech Tagger for Code-Mixed Social Media Text

SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text

Extraction and Attribution of Public Figures Statements for Journalism in Indonesia using Deep Learning

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Efficient Part-of-Speech Tagging with a Min-Max Modular Neural-Network Model

BERT-Based Combination of Convolutional and Recurrent Neural Network for Indonesian Sentiment Analysis

Part-of-Speech Tagging for Code-mixed Indian Social Media Text at ICON 2015

IndoNLI: A Natural Language Inference Dataset for Indonesian

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

A multi domains short message sentiment classification using hybrid neural network architecture

Study on Japanese Word Segmentation and POS Tagging Based on Rules and Statistics

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding