A Study on the Appropriate Size of the Mongolian General Corpus

Choi Sun Soo,Ganbat Tsend
DOI: https://doi.org/10.5121/ijnlc.2023.12302
2023-06-27
International Journal on Natural Language Computing
Abstract:This study aims to determine the appropriate size of the Mongolian general corpus. This study used the Heaps’ function and Type-Token Ratio (TTR) to determine the appropriate size of the Mongolian general corpus. This study’s sample corpus of 906,064 tokens comprised texts from 10 domains of newspaper politics, economy, society, culture, sports, world articles and laws, middle and high school literature textbooks, interview articles, and podcast transcripts. First, we estimated the Heaps’ function with this sample corpus. Next, we observed changes in the number of types and TTR values while increasing the number of tokens by one million using the estimated Heaps’ function. As a result of observation, we found that the TTR value hardly changed when the number of tokens exceeded 39~42 million. Thus, we conclude that an appropriate size for a Mongolian general corpus is 39-42 million tokens.
What problem does this paper attempt to address?