Abstract:Face analysis has been studied from different angles to infer emotion, poses, shapes, and landmarks. Traditionally RGB cameras are used, yet for fine-grained tasks standard sensors might not be up to the task due to their latency, making it impossible to record and detect micro-movements that carry a highly informative signal, which is necessary for inferring the true emotions of a subject. Event cameras have been increasingly gaining interest as a possible solution to this and similar high-frame rate tasks. We propose a novel spatiotemporal Vision Transformer model that uses Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) to enhance the accuracy of Action Unit classification from event streams. We also address the lack of labeled event data in the literature, which can be considered one of the main causes of an existing gap between the maturity of RGB and neuromorphic vision models. Gathering data is harder in the event domain since it cannot be crawled from the web and labeling frames should take into account event aggregation rates and the fact that static parts might not be visible in certain frames. To this end, we present FACEMORPHIC, a temporally synchronized multimodal face dataset composed of RGB videos and event streams. The dataset is annotated at a video level with facial Action Units and contains streams collected with various possible applications, ranging from 3D shape estimation to lip-reading. We then show how temporal synchronization can allow effective neuromorphic face analysis without the need to manually annotate videos: we instead leverage cross-modal supervision bridging the domain gap by representing face shapes in a 3D space. Our proposed model outperforms baseline methods by effectively capturing spatial and temporal information, crucial for recognizing subtle facial micro-expressions.

Spatio-Temporal Aggregation Transformer for Object Detection with Neuromorphic Vision Sensors

A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

SpikingViT: a Multi-scale Spiking Vision Transformer Model for Event-based Object Detection

An Event-Driven Object Recognition Model Using Activated Connected Domain Detection

Scene Adaptive Sparse Transformer for Event-based Object Detection

An Event-based Categorization Model Using Spatio-temporal Features in a Spiking Neural Network.

Spiking Transformers for Event-based Single Object Tracking

Adaptive sparse attention-based compact transformer for object tracking

An Event-Driven Computational System With Spiking Neurons For Object Recognition

Asynchronous Spatio-Temporal Memory Network for Continuous Event-Based Object Detection

HDI-Former: Hybrid Dynamic Interaction ANN-SNN Transformer for Object Detection Using Frames and Events

SODFormer: Streaming Object Detection with Transformer Using Events and Frames

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

GET: Group Event Transformer for Event-Based Vision

Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams

DS2TA: Denoising Spiking Transformer with Attenuated Spatiotemporal Attention

STAT: Spatial-Temporal Attention Mechanism for Video Captioning

Spatio-temporal Transformers for Action Unit Classification with Event Cameras

Real Time Visual Tracking using Spatial-Aware Temporal Aggregation Network