Engagement Detection in Online Learning Based on Pre-trained Vision Transformer and Temporal Convolutional Network

Zhang Hang,Fu Youping,Meng Jun
DOI: https://doi.org/10.1109/ccdc62350.2024.10588350
2024-01-01
Abstract:This paper introduces a novel method for detecting engagement utilizing a combination of a pre-trained vision transformer and a temporal convolutional network. Initially, the vision transformer (ViT) network is pre-trained on the ImageNet1k (ISLVRC2012) dataset to extract spatial features. Subsequently, fine-tuning of the pre-trained ViT network is performed using ICCVW Frame Engagement Annotations. The objective of this study is to discern student engagement from videos capturing their classroom interactions. At each frame, spatial features are extracted using the pre-trained ViT network and then passed to the temporal convolutional network (TCN) for temporal feature extraction. The resulting model provides an assessment of student engagement levels within the video. To train and evaluate this model, we utilize the DAiSEE dataset, a comprehensive repository of student engagement recordings. Furthermore, we address the class imbalance inherent in the DAiSEE dataset by employing a weighted cross-entropy loss function. Experimental findings demonstrate the efficacy of the proposed approach in accurately detecting engagement within the context of online learning.
What problem does this paper attempt to address?