Abstract:A phonemicization or grapheme-to-phoneme conversion (G2P) is a process of converting a word into its pronunciation. It is one of the essential components in speech synthesis, speech recognition, and natural language processing. The deep learning (DL)-based state-of-the-art G2P model generally gives low phoneme error rate (PER) as well as word error rate (WER) for high-resource languages, such as English and European, but not for low-resource languages. Therefore, some conventional machine learning (ML)-based G2P models incorporated with specific linguistic knowledge are preferable for low-resource languages. However, these models are poor for several low-resource languages because of various issues. For instance, an Indonesian G2P model works well for roots but gives a high PER for derivatives. Most errors come from the ambiguities of some roots and derivative words containing four prefixes: <math>〈</math>ber<math>〉,〈</math>meng<math>〉,〈</math>peng<math>〉</math>, and <math>〈</math>ter<math>〉</math>. In this research, an Indonesian G2P model based on n-gram combined with stemmer and phonotactic rules (NGTSP) is proposed to solve those problems. An investigation based on 5-fold cross-validation, using 50 k Indonesian words, informs that the proposed NGTSP gives a much lower PER of 0.78% than the state-of-the-art Transformer-based G2P model (1.14%). Besides, it also provides a much faster processing time.

SUSTEM: An Improved Rule-Based Sundanese Stemmer

Stemming via distribution-based word segregation for classification and retrieval

UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language

An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach

Rule Based Stemmer in Urdu

Morphological Disambiguation from Stemming Data

Building a Multilevel Inflection Handling Stemmer to Improve Search Effectiveness for Urdu Language

N-gram Statistical Stemmer for Bangla Corpus

A new hybrid stemming algorithm for Persian

Stemmer and phonotactic rules to improve n-gram tagger-based indonesian phonemicization

XSTEM: An exemplar-based stemming algorithm

A WSD Model for Corpus Construction

Man-Machine Speech Communication

Looks can be Deceptive: Distinguishing Repetition Disfluency from Reduplication

Fixing the Infix: Unsupervised Discovery of Root-and-Pattern Morphology

Overview of Stemming Algorithms for Indian and Non-Indian Languages

Building a Word Segmenter for Sanskrit Overnight

CBAS: context based arabic stemmer

Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator

Normalized dataset for Sanskrit word segmentation and morphological parsing

Stemmers for Tamil Language: Performance Analysis