Topic modeling, long texts and the best number of topics. Some Problems and solutions

Stefano Sbalchiero,Maciej Eder
DOI: https://doi.org/10.1007/s11135-020-00976-w
2020-02-17
Abstract:The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts' length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to determine the optimal number of topics when applying topic modeling techniques to long texts. Specifically, the paper explores the relationship between text length and the optimal number of topics and verifies this relationship through a series of experiments. The author points out that although topic modeling performs well in short texts, there is still a lack of systematic empirical research on its application to long texts. Therefore, this paper systematically tests the optimal number of topics for text fragments of different lengths through experimental methods and proposes an explicit power - law relationship model to describe this relationship. ### Main problems 1. **Applicability of topic modeling in long texts**: - Existing topic modeling techniques are mainly applied to short texts. How effective are they for long texts (such as books)? - Will the complexity and diversity of long texts affect the effectiveness of topic modeling? 2. **Determination of the optimal number of topics**: - How to determine the optimal number of topics in long texts? - Is there a certain relationship between text length and the optimal number of topics? ### Solutions 1. **Experimental design**: - Use 100 English novels as the benchmark corpus and divide them into text fragments of different lengths (such as 500 words, 1,000 words, 5,000 words, etc.). - Apply the Latent Dirichlet Allocation (LDA) model to calculate the log - likelihood values of text fragments of different lengths under different numbers of topics to determine the optimal number of topics. 2. **Result analysis**: - The experimental results show that as the length of the text fragment increases, the optimal number of topics gradually decreases and tends to stabilize at a certain point. - Propose the Sbalchiero - Eder rule: Given a corpus, the optimal number of topics is inversely proportional to the length of the text fragment, that is, the longer the text fragment, the fewer the optimal number of topics. 3. **Mathematical model**: - Describe the relationship between the optimal number of topics and the length of the text fragment through a power - law model: \[ y = ax^{-b} \] where \( y \) is the optimal number of topics, \( x \) is the number of words in the text fragment, and \( a \) and \( b \) are parameters. ### Conclusions - This paper verifies the relationship between text length and the optimal number of topics through systematic experiments and proposes an explicit power - law model. - This finding not only provides a theoretical basis for topic modeling of long texts but also provides a new direction for future research, especially for further verifying this relationship in different literary genres and texts with higher content density. ### Future research directions - Explore the performance of different literary genres (such as scientific papers, news reports, etc.) in topic modeling and whether it is necessary to adjust the length of text fragments. - Further optimize the topic - modeling method to improve its applicability and accuracy in long texts.