Abstract:Automated measurement of student engagement equips educators with valuable insights, aiding them in achieving educational program objectives and customizing their approach to suit individual students. Engagement measurement requires a detailed analysis of the behavioral and affective states of students over precise timescales. A range of current techniques have engineered sequential and spatiotemporal models, including recurrent neural networks, temporal convolutional networks, three-dimensional convolutional neural networks, and transformers to measure engagement from video data. These models are trained to incorporate the sequential/temporal order of behavioral and affective states into the video analysis, outputting their level of engagement. Drawing upon the definition of engagement in educational psychology, this paper questions the necessity of incorporating the order of behavioral and affective states into engagement measurement. Non-sequential bag-of-words-based models are developed to analyze behavioral and affective features extracted from videos and output engagement levels. The non-sequential models only analyze the occurrence of behavioral and affective states not the order in which they occur. Experimental results indicate that the proposed non-sequential approach is superior to state-of-the-art sequential engagement measurement approaches. On the IIITB Online SE dataset, the proposed approach significantly improved engagement level classification accuracy by 22%, and 26%, respectively, compared to the recurrent neural network, and the temporal convolutional network. It also improved minority class recall and achieved a classification accuracy as high as 0.6658 On the DAiSEE dataset. In another experiment, models displayed consistent performance while trained on the shuffled versions of the datasets compared with those trained on the original, unshuffled datasets. In the shuffled versions, behavioral and affective states within video samples were randomly permuted. These observations reinforce the notion that the order in which affective and behavioral states occur does not impact engagement measurement.

Class-attention video transformer for engagement prediction

Class-attention Video Transformer for Engagement Intensity Prediction

Engagement Detection in Online Learning Based on Pre-trained Vision Transformer and Temporal Convolutional Network

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Is Space-Time Attention All You Need for Video Understanding?

Delving Deep into Engagement Prediction of Short Videos

Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers

Space or time for video classification transformers

Bag of states: a non-sequential approach to video-based engagement measurement

A transformer-based approach to video frame-level prediction in Affective Behaviour Analysis In-the-wild

Sec2Sec Co-attention for Video-Based Apparent Affective Prediction

Video Saliency Forecasting Transformer

Transformer-Based Interactive Multi-Modal Attention Network for Video Sentiment Detection

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Predicting Engagement in Video Lectures

An Extended Text Combination Classification Model for Short Video Based on Albert

Towards Long-Form Video Understanding

TDViT: Temporal Dilated Video Transformer for Dense Video Tasks

Enhancing Transformer Backbone for Egocentric Video Action Segmentation