Abstract:The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts' length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

Co-Word Maps and Topic Modeling: A Comparison Using Small and Medium-Sized Corpora (N < 1,000)

Co-word Maps and Topic Modeling: A Comparison Using Small and Medium-Sized Corpora (n < 1000)

Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data

Topic Modelling of Empirical Text Corpora: Validity, Reliability, and Reproducibility in Comparison to Semantic Maps.

Parsimonious Topic Models with Salient Word Discovery

Topic Modeling Using Distributed Word Embeddings

Analyses of Multi-collection Corpora via Compound Topic Modeling

Can Topic Models Be Used in Research Evaluations? Reproducibility, Validity, and Reliability when Compared with Semantic Maps

A comparison of citation-based clustering and topic modeling for science mapping

Mitigating Data Sparsity for Short Text Topic Modeling by Topic-Semantic Contrastive Learning

Graph-based Multimodal Topic Modeling with Word Relations and Object Relations

The Semantic Mapping of Words and Co-Words in Contexts

Using Topic Modeling for Code Discovery in Large Scale Text Data

Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling

A Correlated Topic Model Using Word Embeddings

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Computer-Assisted Text Analysis for Social Science: Topic Models and Beyond

Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings

No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications

The Geometric Structure of Topic Models

Intensity of Relationship Between Words: Using Word Triangles in Topic Discovery for Short Texts