Abstract:One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477-5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.

Topic Modelling of Empirical Text Corpora: Validity, Reliability, and Reproducibility in Comparison to Semantic Maps.

Can Topic Models Be Used in Research Evaluations? Reproducibility, Validity, and Reliability when Compared with Semantic Maps

Topic Discovery Based on LDA_col Model and Topic Significance Re-ranking.

Co-word Maps and Topic Modeling: A Comparison Using Small and Medium-Sized Corpora (n < 1000)

Co-Word Maps and Topic Modeling: A Comparison Using Small and Medium-Sized Corpora (N < 1,000)

Parsimonious Topic Models with Salient Word Discovery

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Agreeing to Disagree: Choosing Among Eight Topic-Modeling Methods

A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts

Topic and Keyword Re-Ranking for LDA-based Topic Modeling

LDAPrototype: a model selection algorithm to improve reliability of latent Dirichlet allocation

Expansive data, extensive model: Investigating discussion topics around LLM through unsupervised machine learning in academic papers and news

Comparison of Topic Modelling Approaches in the Banking Context

A Semantics-enhanced Topic Modelling Technique: Semantic-LDA

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Topic Modelling: Going Beyond Token Outputs

A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations

Statistical Word Sense Aware Topic Models

The early days of contemporary philosophy of science: novel insights from machine translation and topic-modeling of non-parallel multilingual corpora

A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis

LDAExplore: Visualizing Topic Models Generated Using Latent Dirichlet Allocation