A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Yan Ru Pei,Sasskia Brüers,Sébastien Crouzet,Douglas McLelland,Olivier Coenen
2024-04-13
Abstract:Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily aims to address the problem of online eye-tracking using event cameras in edge computing environments, especially in scenarios requiring efficient and low-latency applications. To better handle the data generated by event cameras and leverage their rich temporal features, the authors propose a causal spatiotemporal convolutional network. Specifically, the main contributions of the paper include: 1. **Design of a Lightweight Spatiotemporal Network**: A fully causal lightweight spatiotemporal neural network is designed, capable of efficient online inference on streaming data through a FIFO buffer without the need to store all time frames. 2. **Causal Event Volume Binning Strategy**: A causal event volume binning strategy is proposed to minimize latency and reduce excessive buffering of the event stream during online inference. 3. **Increased Activation Sparsity**: Through L1 regularization during the training process, the sparsity (zero-value output) of each layer's output is significantly increased, exceeding 90%, which helps achieve efficient inference on processors that can exploit this sparsity. 4. **Normalization Strategy**: Alternating BatchNorm and GroupNorm layers are used while maintaining complete causality during inference. The paper applies the proposed model to the AIS 2024 Event-based Eye-tracking Challenge and achieves a 0.9916 p10 accuracy on the Kaggle private test set. Additionally, a series of related works are introduced, including different event binning methods, spatiotemporal networks, and lightweight detection heads. The methods for processing event data, the design of the network architecture, and its configuration for online inference are described in detail. Finally, the paper validates the contributions of different components to the final results through a series of experiments and provides a detailed analysis.