Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback

Jacob Whitehill,Jennifer LoCasale-Crouch
2024-04-17
Abstract:With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching, we explore how Large Language Models (LLMs) can be used to estimate ``Instructional Support'' domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol. We design a machine learning architecture that uses either zero-shot prompting of Meta's Llama2, and/or a classic Bag of Words (BoW) model, to classify individual utterances of teachers' speech (transcribed automatically using OpenAI's Whisper) for the presence of Instructional Support. Then, these utterance-level judgments are aggregated over a 15-min observation session to estimate a global CLASS score. Experiments on two CLASS-coded datasets of toddler and pre-kindergarten classrooms indicate that (1) automatic CLASS Instructional Support estimation accuracy using the proposed method (Pearson $R$ up to $0.48$) approaches human inter-rater reliability (up to $R=0.55$); (2) LLMs generally yield slightly greater accuracy than BoW for this task, though the best models often combined features extracted from both LLM and BoW; and (3) for classifying individual utterances, there is still room for improvement of automated methods compared to human-level judgments. Finally, (4) we illustrate how the model's outputs can be visualized at the utterance level to provide teachers with explainable feedback on which utterances were most positively or negatively correlated with specific CLASS dimensions.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to provide teachers with more specific, frequent, and actionable teaching feedback. Specifically, the author explores how to use large language models (LLMs) to estimate the scores in the "Instructional Support" domain of the Classroom Assessment Scoring System (CLASS). In this way, teachers can obtain automated feedback on their teaching performance. This feedback not only covers the overall assessment but also can be specific to each teaching segment, thus helping teachers better understand which aspects need improvement. ### Main Problems 1. **Providing More Specific Feedback**: Traditional manual evaluations usually can only provide general summaries and lack detailed analysis of specific teaching interactions. Through automated methods, specific teaching utterances can be identified, and it can be pointed out which utterances contribute to or impede the teaching quality. 2. **Increasing the Frequency of Feedback**: Currently, teachers can only receive feedback from principals or senior colleagues a few times a year, while an automated system can make feedback more frequent and can even be generated immediately after each lesson. 3. **Ensuring the Objectivity and Consistency of Feedback**: Manual evaluations may have subjective differences, and an automated system can reduce such differences through a standardized approach and improve the consistency of evaluation. ### Solutions The author designed a machine - learning architecture, using the zero - sample - prompt Meta Llama2 model and/or the classic Bag - of - Words (BoW) model to classify the "Instructional Support" features in teachers' utterances. The specific steps are as follows: - **Automatic Transcription**: Use OpenAI's Whisper automatic speech recognition technology to transcribe classroom recordings into text. - **Feature Extraction**: Use the Llama2 or BoW model to analyze each utterance to determine whether it contains the instructional support behavior indicators defined by CLASS. - **Aggregation and Regression**: Aggregate these utterance - level judgment results and estimate the CLASS score within the entire 15 - minute observation period through a linear regression model. ### Experimental Results Experiments show that the correlation between the CLASS "Instructional Support" domain scores estimated using this method and human raters can be as high as 0.48 (Pearson correlation coefficient), approaching the consistency among human raters (up to 0.55). In addition, LLMs are slightly more accurate than BoW models in this task, but the best model is usually a hybrid model that combines the features of both. ### Application Prospects This research shows how to use artificial intelligence technology to provide teachers with more specific, frequent, and accurate teaching feedback, thereby promoting teachers' professional development and improving teaching quality. By visualizing the model output, teachers can see which utterances best represent or deviate from specific CLASS dimensions, thus obtaining more explanatory and actionable feedback.