Abstract:We present a lexicon-based model for segmenting Chinese text into dictionary entries and for providing pronunciations for these words. This approach adopts a matching algorithm combined with several heuristic rules to resolve the ambiguities. It can achieve total accuracy over 95%, which proved to be an effective solution to grapheme-to-phoneme conversion for Mandarin Chinese. Introduction The written Chinese texts are composed with strings of characters without blanks to delimit words. The first step towards word-based indexing is to break a sequence of characters into words. This process is called word segmentation. On the other hand, it is not possible to bypass the word-segmentation problem. The main reason is that many Chinese characters are homographs, whose pronunciation depends upon word affiliation. The Problem of Word Segmentation There are difficulties with the word identification process. First of all, almost all characters are free morphemes, which can be one-character words by themselves. They can also join other characters to form multi-character words. Second, compounding is the predominant word formation device in modern Chinese. It is difficult to tell whether a lowfrequency compound is a word or phrase. Third, the same pool of characters is also used in constructing proper names, which brings difficulty in personal name identification [2]. Strategies in Word Segmentation In order to cope with this problem, there exist some methods which can be classified into (1) Purely statistical approaches [1]; (2) Heuristic rule-based methods [2]; (3) Statistical approaches which incorporate lexical knowledge [3]. Many statistical methods are based on a large pre-segmented text corpus for their analysis. The easiest and most effective one is the lexical based algorithm with supplementary rules. This is also adopted in our TTS system DRESS, but is modified to pass our system. The paper first introduces our synthesis system. It then presents the solution of word identification and phonetic conversion. Finally, it points out the possibility for future research. Synthesis System The Mandarin Chinese Text-to-Speech system developed at TU Dresden is a syllable-based waveform concatenation synthesis. It consists of text analysis and acoustic synthesis. The acoustic synthesis is already accomplished with high naturalness. A syllable-based inventory takes the crosssyllable co-articulation into consideration [4]. A neural network is responsible for learning and modifying the duration and intonation [5]. Because of the unsolved problem of grapheme-to-phoneme conversion, the word boundaries had been inserted manually in the process of synthesis. This paper presents the solution of word segmentation, which makes the whole text-to-speech system to operate automatically. Word Segmentation The processing stage of word segmentation includes an algorithm of maximum matching with word lexicon, several ambiguity resolution rules, and some solutions to deal with time, numeral expressions and to identify personal names. Input text (A string of Chinese characters) Figure 1: Grapheme-phoneme conversion Maximum Path-Matching The lexical-based word identification approach is matching, the basic strategy is to match the input characters string with a large set of entries stored in a pre-compiled lexicon to find all (or part of) possible segmentations. Another variant of maximal matching done in [2] says that the most plausible segmentation is the three-word chunks with maximal length. This algorithm is adopted in our system. Word Identification Word Lexicon Ambiguity Resolution Rules Time & Numerals Expressions Name Identity Grapheme-Phoneme Conversion Prosodic Generation Phonetic Sequences with Tones Segmented Words in Characters Word Lexicon with Phonetic Transcription CFA/DAGA'04, Strasbourg, 22-25/03/2004

Can Evolutionary Computation Help us to Crib the Voynich Manuscript ?

The Voynich Manuscript was written in a single, natural language

Character Entropy in Modern and Historical Texts: Comparison Metrics for an Undeciphered Manuscript

Computer-Aided Modelling of the Bilingual Word Indices to the Ninth-Century Uchitel'noe evangelie

An Authentic Algorithm for Ciphering and Deciphering Called Latin Djokovic

Create distinctive databases of ancient languages and using a computer vision model to accurately recognize and classify them.

Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Phonetic and Visual Priors for Decipherment of Informal Romanization

Transcribing Medieval Manuscripts for Machine Learning

Dating ancient manuscripts using radiocarbon and AI-based writing style analysis

Toward a Period-specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Levenshtein OCR

Evolution of Efficient Symbolic Communication Codes

Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling Approach

Decipherment of Historical Manuscript Images

A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

Hebrew letters Detection and Cuneiform tablets Classification by using the yolov8 computer vision model

Quantifying Character Similarity with Vision Transformers

Grapheme-to-Phoneme Conversion in Mandarin Chinese Text-to-Speech System

Computation Model of Human Vowel System Evolution Based on Meme

Automatic transcription of 17th century English text in Contemporary English with NooJ: Method and Evaluation