Facial Instance Learning for Video-based ASD Diagnosis

Wenhao Duan,Jing Li,Gaoxiang Ouyang
DOI: https://doi.org/10.1109/m2vip62491.2024.10746003
2024-01-01
Abstract:Autism Spectrum Disorder (ASD) is a common neurodevelopmental disorder marked by social interaction difficulties and repetitive, stereotyped behaviors. Previous studies have shown that facial expression recognition has been successfully applied to ASD classification. However, they typically designed a specific framework to explore predefined facial patterns in video data, which usually ignore subtle but potentially vital information embedded in facial movements. To address this limitation and fully leverage the richness of facial data, we design the facial instance learning (FIL) network, a video classification network based on weakly supervised learning. Firstly, the FIL network leverages multi-instance learning to efficiently analyze video sequences without relying on frame-by-frame annotations. Secondly, we employ the R3D18 backbone network to capture both short-term facial dynamics and spatial details of video sequences. Finally, we design the spatiotemporal aggregation module to learn long-term temporal relationships and spatial features. The experiments on the DERW dataset show that the performance of the FIL network is competitive in dynamic facial expression recognition and shows a surprising effect on ASD diagnosis, achieving the accuracy of 98.72%.
What problem does this paper attempt to address?