Abstract:We introduce a novel multimodality synergistic knowledge distillation scheme tailored for efficient single-eye motion recognition tasks. This method allows a lightweight, unimodal student spiking neural network (SNN) to extract rich knowledge from an event-frame multimodal teacher network. The core strength of this approach is its ability to utilize the ample, coarser temporal cues found in conventional frames for effective emotion recognition. Consequently, our method adeptly interprets both temporal and spatial information from the conventional frame domain, eliminating the need for specialized sensing devices, e.g., event-based camera. The effectiveness of our approach is thoroughly demonstrated using both existing and our compiled single-eye emotion recognition datasets, achieving unparalleled performance in accuracy and efficiency over existing state-of-the-art methods.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of how to perform monocular emotion recognition efficiently and accurately in virtual reality (VR) and augmented reality (AR) applications. Specifically, the paper focuses on the following challenges: 1. **Limitations of traditional facial expression recognition**: - In VR/AR devices, since the devices are usually fixed on the user's face, most of the facial area is blocked, resulting in poor performance of traditional emotion recognition methods based on facial action units. - Existing eye - based emotion recognition methods often rely on personalized initialization or need to capture the peak stage of emotions, which limits their practicality. 2. **Cost and technical limitations of event cameras**: - Although event cameras perform well in low - light and high - dynamic - range environments and provide higher temporal resolution and dynamic range, their high cost and relatively immature technology make it difficult to be widely used in VR/AR devices. 3. **Performance improvement of lightweight models**: - How to transfer the knowledge of complex multi - modal teacher networks to lightweight single - modal student networks through knowledge distillation (Knowledge Distillation) to achieve efficient emotion recognition while maintaining high accuracy and robustness. ### Solutions To solve the above problems, the paper proposes a new Multimodality Synergistic Knowledge Distillation (MSKD) framework, which specifically includes the following aspects: - **Multi - modal teacher network**: Use event data and intensity frames to train a complex multi - modal teacher network, which can extract rich spatio - temporal information. - **Lightweight student network**: Design a lightweight single - modal student network that only uses conventional intensity frames for inference, thus eliminating the need for expensive event cameras. - **Synergistic knowledge distillation loss**: Introduce two novel consistency losses - Hit Consistency and Temporal Consistency - to ensure that the student network is highly consistent with the teacher network in the prediction distribution, especially in terms of the correct prediction timestamps and performance at all timestamps. Through these methods, the paper has successfully achieved efficient and accurate monocular emotion recognition, which is suitable for resource - constrained devices and has achieved performance exceeding the existing state - of - the - art methods on existing datasets. ### Summary The main contribution of this paper is to develop a new framework that enables lightweight student networks to extract rich knowledge from multi - modal teacher networks, thereby achieving efficient and accurate monocular emotion recognition with only a conventional camera, significantly improving the user experience and emotional interaction capabilities in VR/AR applications.

Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition

Exploring Spatio-Temporal Representations by Integrating Attention-based Bidirectional-LSTM-RNNs and FCNs for Speech Emotion Recognition

MFDR: Multiple-stage Fusion and Dynamically Refined Network for Multimodal Emotion Recognition

Simplifying Multimodal Emotion Recognition with Single Eye Movement Modality

Temporal Convolutional Network-Enhanced Real-Time Implicit Emotion Recognition with an Innovative Wearable fNIRS-EEG Dual-Modal System

Synch-Graph: Multisensory Emotion Recognition Through Neural Synchrony Via Graph Convolutional Networks.

In the Blink of an Eye: Event-based Emotion Recognition

FusionSense: Emotion Classification Using Feature Fusion of Multimodal Data and Deep Learning in a Brain-Inspired Spiking Neural Network

EESCN: A novel spiking neural network method for EEG-based emotion recognition

Visual-Audio Emotion Recognition Based on Multi-Task and Ensemble Learning with Multiple Features

A multimodal shared network with a cross-modal distribution constraint for continuous emotion recognition

Smile: Spiking Multi-Modal Interactive Label-Guided Enhancement Network for Emotion Recognition

Multimodal Emotion Recognition based on the Fusion of EEG Signals and Eye Movement Data

A novel feature fusion network for multimodal emotion recognition from EEG and eye movement signals

A Method of Multimodal Emotion Recognition in Video Learning Based on Knowledge Enhancement

Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video

Cross-Modal Guiding Neural Network for Multimodal Emotion Recognition From EEG and Eye Movement Signals

E-MFNN: an emotion-multimodal fusion neural network framework for emotion recognition

Hierarchical Event-RGB Interaction Network for single-eye expression recognition

Knowledge distillation based lightweight domain adversarial neural network for electroencephalogram-based emotion recognition

A multi-stage dynamical fusion network for multimodal emotion recognition