Using Text Embeddings for Deductive Qualitative Research at Scale in Physics Education

Tor Ole B. Odden,Halvor Tyseng,Jonas Timmann Mjaaland,Markus Fleten Kreutzer,Anders Malthe-Sørenssen
2024-02-28
Abstract:We propose a technique for performing deductive qualitative data analysis at scale on text-based data. Using a natural language processing technique known as text embeddings, we create vector-based representations of texts in a high-dimensional meaning space within which it is possible to quantify differences as vector distances. To apply the technique, we build off prior work that used topic modeling via Latent Dirichlet Allocation to thematically analyze 18 years of the Physics Education Research Conference proceedings literature. We first extend this analysis through 2023. Next, we create embeddings of all texts and, using representative articles from the 10 topics found by the LDA analysis, define centroids in the meaning space. We calculate the distances between every article and centroid and use the inverted, scaled distances between these centroids and articles to create an alternate topic model. We benchmark this model against the LDA model results and show that this embeddings model recovers most of the trends from that analysis. Finally, to illustrate the versatility of the method we define 8 new topic centroids derived from a review of the physics education research literature by Docktor and Mestre (2014) and re-analyze the literature using these researcher-defined topics. Based on these analyses, we critically discuss the features, uses, and limitations of this method and argue that it holds promise for flexible deductive qualitative analysis of a wide variety of text-based data that avoids many of the drawbacks inherent to prior NLP methods.
Physics Education,Data Analysis, Statistics and Probability
What problem does this paper attempt to address?
This paper proposes a technique for conducting inductive qualitative analysis on large-scale text data, particularly focusing on literature in physics education research. By using text embedding technology in natural language processing, the text is transformed into vectors in a high-dimensional semantic space, enabling the quantification of differences between texts. The researchers first extended the analysis of conference papers in physics education research using latent Dirichlet allocation (LDA) topic modeling. They then created embeddings for all texts and defined centroids based on representative articles from 10 topics identified through LDA analysis. By calculating the distance between each article and these centroids, an alternative topic model was constructed and compared against the results of the LDA model, demonstrating that the embedding model is able to recover most of the trends. Additionally, the researchers reanalyzed the literature using 8 new topics defined in the comments of physicists Docktor and Mestre (2014), showcasing the flexibility of the approach. The main problems addressed by this paper are: 1. How to utilize natural language processing techniques, particularly text embedding, to achieve inductive qualitative analysis on large-scale text data, in order to improve efficiency and overcome limitations of traditional methods? 2. How to evaluate and validate the effectiveness of this text embedding-based analysis approach, and what are its advantages and limitations compared to traditional LDA topic modeling? 3. How to demonstrate the generality of this approach, such as through the reanalysis of literature using researcher-defined topics? The paper showcases the potential of this approach in inductive analysis through the extension of previous LDA analysis, creation of new topic centroids, and distance calculations using text embedding technology. Furthermore, it discusses the characteristics, applications, and limitations of this approach, providing a possible tool for qualitative analysis of extensive text data in the future.