Facial Expression Recognition in Video Using 3D-CNN Deep Features Discrimination

Sathisha G,C. K. Subbaraya,R. G K
DOI: https://doi.org/10.1109/INOCON60754.2024.10512101
2024-03-01
Abstract:The focus of research work presented in this paper on improving performance in video facial expression recognition using a computationally efficient 3D convolutional neural network. An end-to-end 3D CNN method is proposed, which employs R(2+1)D Resnet18 as the backbone encoder block, followed by a spatio-temporal attention block divided into two heads for the evaluation of sparse center-loss and cross-entropy loss. The inclusion of a spatio-temporal block in the network has adaptively refined features in both the spatial and temporal domains. The network's combination of sparse center-loss and cross-entropy loss has extracted significant feature elements for enhanced discrimination in the embedding space. The proposed architecture is evaluated on three in-the-wild publicly available datasets, i.e., DFEW, AFEW and DAiSEE. On the DFEW, AFEW and DAiSEE datasets, our method required only 38.66 Giga MACs for one batch prediction and achieved accuracy of 59.28%, 59.63%, and 58.62%, respectively. The results show that, when compared to earlier works in the DFEW, AFEW and DAiSEE datasets, our method performed comparably well with significantly low computational load.
Computer Science
What problem does this paper attempt to address?