Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition

Yang Wang,Haiyang Mei,Qirui Bao,Ziqi Wei,Mike Zheng Shou,Haizhou Li,Bo Dong,Xin Yang
2024-06-20
Abstract:We introduce a novel multimodality synergistic knowledge distillation scheme tailored for efficient single-eye motion recognition tasks. This method allows a lightweight, unimodal student spiking neural network (SNN) to extract rich knowledge from an event-frame multimodal teacher network. The core strength of this approach is its ability to utilize the ample, coarser temporal cues found in conventional frames for effective emotion recognition. Consequently, our method adeptly interprets both temporal and spatial information from the conventional frame domain, eliminating the need for specialized sensing devices, e.g., event-based camera. The effectiveness of our approach is thoroughly demonstrated using both existing and our compiled single-eye emotion recognition datasets, achieving unparalleled performance in accuracy and efficiency over existing state-of-the-art methods.
Computer Vision and Pattern Recognition,Neural and Evolutionary Computing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of how to perform monocular emotion recognition efficiently and accurately in virtual reality (VR) and augmented reality (AR) applications. Specifically, the paper focuses on the following challenges: 1. **Limitations of traditional facial expression recognition**: - In VR/AR devices, since the devices are usually fixed on the user's face, most of the facial area is blocked, resulting in poor performance of traditional emotion recognition methods based on facial action units. - Existing eye - based emotion recognition methods often rely on personalized initialization or need to capture the peak stage of emotions, which limits their practicality. 2. **Cost and technical limitations of event cameras**: - Although event cameras perform well in low - light and high - dynamic - range environments and provide higher temporal resolution and dynamic range, their high cost and relatively immature technology make it difficult to be widely used in VR/AR devices. 3. **Performance improvement of lightweight models**: - How to transfer the knowledge of complex multi - modal teacher networks to lightweight single - modal student networks through knowledge distillation (Knowledge Distillation) to achieve efficient emotion recognition while maintaining high accuracy and robustness. ### Solutions To solve the above problems, the paper proposes a new Multimodality Synergistic Knowledge Distillation (MSKD) framework, which specifically includes the following aspects: - **Multi - modal teacher network**: Use event data and intensity frames to train a complex multi - modal teacher network, which can extract rich spatio - temporal information. - **Lightweight student network**: Design a lightweight single - modal student network that only uses conventional intensity frames for inference, thus eliminating the need for expensive event cameras. - **Synergistic knowledge distillation loss**: Introduce two novel consistency losses - Hit Consistency and Temporal Consistency - to ensure that the student network is highly consistent with the teacher network in the prediction distribution, especially in terms of the correct prediction timestamps and performance at all timestamps. Through these methods, the paper has successfully achieved efficient and accurate monocular emotion recognition, which is suitable for resource - constrained devices and has achieved performance exceeding the existing state - of - the - art methods on existing datasets. ### Summary The main contribution of this paper is to develop a new framework that enables lightweight student networks to extract rich knowledge from multi - modal teacher networks, thereby achieving efficient and accurate monocular emotion recognition with only a conventional camera, significantly improving the user experience and emotional interaction capabilities in VR/AR applications.