Abstract:Tibetan is a low-resource language with few existing electronic reference materials. The goal of Tibetan sentence boundary disambiguation (SBD) is to segment long text into sentences, and it is the foundation for downstream tasks corpora building. This study implemented the Tibetan SBD at the syllable level to avoid word segmentation (WS) errors affecting the accuracy of SBD. Specifically, the attention mechanism is introduced based on a recurrent neural network (RNN) to study Tibetan SBD. The primary objective is to determine, using a trained model, whether the shad contained in Tibetan text is the ending of the sentence, and implement experiments on syllable embedding and component embedding to measure the model's performance. The highest accuracy for Tibetan syllable embedding and component embedding is 96.23% and 95.40 %, respectively, and the F1 score reaches 96.23% and 95.37%, respectively. The experimental results demonstrate that the proposed method can achieve better results than the established rule-based and statistical methods without considering various syntactic and part-of-speech (POS) tagging rules. German and English data from the Europarl corpus and Thai data from the IWSLT2015 corpus are validated to prove the models’ reliability and generalizability. The results demonstrate that this method is efficient not only for low-resource languages but also for high-resource languages. More importantly, we can formally apply the experimental results of this study to the research of downstream tasks, such as machine translation and automatic summarization.

Sentence Boundary Detection of Uyghur Based on Rules and Statistics

Polygon-Location Method Based on Uyghur Text Regional Rules

Uyghur Word Segmentation Using a Combination of Rules and Statistics

A Tibetan Sentence Boundary Disambiguation Model Considering the Components on Information on Both Sides of Shad

Sentence Boundary Disambiguation for Tibetan Based on Attention Mechanism at the Syllable Level.

Uyghur-Chinese statistical machine translation by incorporating morphological information

Uyghur Morphological Segmentation with Bidirectional GRU Neural Networks

Error Analysis of Uyghur Name Tagging: Language-specific Techniques and Remaining Challenges.

Research on Uyghur Morphological Segmentation Based on Long Sequence Labeling Method.

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

Scene Uyghur Text Detection Based on Fine-Grained Feature Representation

An Improved Method for Uyghur Sentence Similarity Computation

Chinese Comparative Sentence Identification Based on the Combination of Rules and Statistics

Research on Recognition of Semantic Chunk Boundary in Tibetan.

Modeling Uyghur Speech Phenomena with Morphological Rules

Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection

Prosody Boundary Detection Through Context-Dependent Position Models.

Automatic sentence segmentation for classical Chinese: The Spring and Autumn Annals as an example

Sentence Sentiment Analysis Based On Ambiguous Words

A Statistical Method for Uyghur Tokenization

Mandarin prosodic word prediction using dependency relationships