Abstract:We propose a technique for performing deductive qualitative data analysis at scale on text-based data. Using a natural language processing technique known as text embeddings, we create vector-based representations of texts in a high-dimensional meaning space within which it is possible to quantify differences as vector distances. To apply the technique, we build off prior work that used topic modeling via Latent Dirichlet Allocation to thematically analyze 18 years of the Physics Education Research Conference proceedings literature. We first extend this analysis through 2023. Next, we create embeddings of all texts and, using representative articles from the 10 topics found by the LDA analysis, define centroids in the meaning space. We calculate the distances between every article and centroid and use the inverted, scaled distances between these centroids and articles to create an alternate topic model. We benchmark this model against the LDA model results and show that this embeddings model recovers most of the trends from that analysis. Finally, to illustrate the versatility of the method we define 8 new topic centroids derived from a review of the physics education research literature by Docktor and Mestre (2014) and re-analyze the literature using these researcher-defined topics. Based on these analyses, we critically discuss the features, uses, and limitations of this method and argue that it holds promise for flexible deductive qualitative analysis of a wide variety of text-based data that avoids many of the drawbacks inherent to prior NLP methods.

What problem does this paper attempt to address?

This paper proposes a technique for conducting inductive qualitative analysis on large-scale text data, particularly focusing on literature in physics education research. By using text embedding technology in natural language processing, the text is transformed into vectors in a high-dimensional semantic space, enabling the quantification of differences between texts. The researchers first extended the analysis of conference papers in physics education research using latent Dirichlet allocation (LDA) topic modeling. They then created embeddings for all texts and defined centroids based on representative articles from 10 topics identified through LDA analysis. By calculating the distance between each article and these centroids, an alternative topic model was constructed and compared against the results of the LDA model, demonstrating that the embedding model is able to recover most of the trends. Additionally, the researchers reanalyzed the literature using 8 new topics defined in the comments of physicists Docktor and Mestre (2014), showcasing the flexibility of the approach. The main problems addressed by this paper are: 1. How to utilize natural language processing techniques, particularly text embedding, to achieve inductive qualitative analysis on large-scale text data, in order to improve efficiency and overcome limitations of traditional methods? 2. How to evaluate and validate the effectiveness of this text embedding-based analysis approach, and what are its advantages and limitations compared to traditional LDA topic modeling? 3. How to demonstrate the generality of this approach, such as through the reanalysis of literature using researcher-defined topics? The paper showcases the potential of this approach in inductive analysis through the extension of previous LDA analysis, creation of new topic centroids, and distance calculations using text embedding technology. Furthermore, it discusses the characteristics, applications, and limitations of this approach, providing a possible tool for qualitative analysis of extensive text data in the future.

Using Text Embeddings for Deductive Qualitative Research at Scale in Physics Education

Thematic Analysis of 18 Years of PERC Proceedings using Natural Language Processing

Representing the Disciplinary Structure of Physics: A Comparative Evaluation of Graph and Text Embedding Methods

Beyond analytics: Using computer‐aided methods in educational research to extend qualitative data analysis

Topic Modelling using Latent Dirichlet Allocation (LDA) to Investigate the Latent Topics of Mathematical Creative Thinking Research in Indonesia

Quantitative analysis of large amounts of journalistic texts using topic modelling

Topic Modeling Using Distributed Word Embeddings

Exploring Technology- and Sensor-Driven Trends in Education: A Natural-Language-Processing-Enhanced Bibliometrics Study

Using Generative Text Models to Create Qualitative Codebooks for Student Evaluations of Teaching

PhysBERT: A Text Embedding Model for Physics Scientific Literature

Text Analysis of ETDs in ProQuest Dissertations and Theses (PQDT) Global (2016-2018)

Text as Data Methods for Education Research

Data Science and Machine Learning in Education

Text2PDE: Latent Diffusion Models for Accessible Physics Simulation

Quantitative approaches to content analysis: identifying conceptual drift across publication outlets

Analyzing social media data: A mixed-methods framework combining computational and qualitative text analysis

REVIEW OF TRENDS IN PHYSICS EDUCATION RESEARCH USING TOPIC MODELING

Topic Modeling over Short Texts by Incorporating Word Embeddings

A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations

Discovering emergent connections in quantum physics research via dynamic word embeddings

On Quantifying Qualitative Geospatial Data: A Probabilistic Approach