Zipf's Law and Statistical Data on Modern Tibetan.

Huidan Liu,Minghua Nuo,Jian Wu
2014-01-01
Abstract:In this paper, a large scale modern Tibetan text corpus is built, which includes about 190 thousands documents, 67.21 million words, 93.66 million syllables in total. Based on the corpus, statistics are made in several language units in different granularities. Statistical data show that : a syllable has 3.26 letters or 2.20 super characters in average, while a sentence has 75.40 letters or 63.14 super characters. The top 10 super characters, syllables, words take up 66.3156%, 16.5556%, 24.6415% of the corpus respectively. Curves for the n-gram frequency-rank list of super chars, syllables and words are plotted. It shows that when all the n-gram phrases for n = 1, 2, . . . , 5 are put together and sorted by frequency in descending order, the frequency-rank curves in log-log axes can be fitted well by a straight line for the unit of syllable and word respectively. But for the unit of super character, we didn’t find a curve that can be fitted well enough by a straight line even if we combine all the n-grams for n = 1, 2, . . . , 10.
What problem does this paper attempt to address?