Using CSNNs to Perform Event-based Data Processing & Classification on ASL-DVS

Ria Patel,Sujit Tripathy,Zachary Sublett,Seoyoung An,Riya Patel
2024-08-01
Abstract:Recent advancements in bio-inspired visual sensing and neuromorphic computing have led to the development of various highly efficient bio-inspired solutions with real-world applications. One notable application integrates event-based cameras with spiking neural networks (SNNs) to process event-based sequences that are asynchronous and sparse, making them difficult to handle. In this project, we develop a convolutional spiking neural network (CSNN) architecture that leverages convolutional operations and recurrent properties of a spiking neuron to learn the spatial and temporal relations in the ASL-DVS gesture dataset. The ASL-DVS gesture dataset is a neuromorphic dataset containing hand gestures when displaying 24 letters (A to Y, excluding J and Z due to the nature of their symbols) from the American Sign Language (ASL). We performed classification on a pre-processed subset of the full ASL-DVS dataset to identify letter signs and achieved 100\% training accuracy. Specifically, this was achieved by training in the Google Cloud compute platform while using a learning rate of 0.0005, batch size of 25 (total of 20 batches), 200 iterations, and 10 epochs.
Neural and Evolutionary Computing,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop an architecture based on Convolutional Spiking Neural Network (CSNN) to process and classify gesture event data from the ASL - DVS dataset. Specifically, the researchers hope to use CSNN to learn and recognize the spatio - temporal relationships in American Sign Language (ASL) gesture data. These data are captured by event - based cameras (DVS) and are asynchronous and sparse, which makes it difficult for traditional processing methods to handle. ### Decomposition of the main problems: 1. **Asynchronous and sparse data processing**: - The data captured by event - based cameras (DVS) is asynchronous and sparse. Unlike traditional frame - based cameras, it only records events of pixel changes. Therefore, how to effectively process this type of data is a challenge. 2. **Spatio - temporal feature extraction**: - The gesture actions included in the ASL - DVS dataset have strong spatio - temporal correlations. How to effectively extract these features through CSNN and use them for classification tasks is another key problem. 3. **High - precision classification**: - The researchers hope to achieve high - precision gesture classification on the ASL - DVS dataset by training the CSNN model. They conducted experiments on the Google Cloud platform and finally achieved a 100% training accuracy rate. ### Overview of solutions: - **CSNN architecture design**: - Using the convolutional operation and the recursive characteristics of spiking neurons, a CSNN architecture that can learn the spatio - temporal relationships in ASL - DVS gesture data is constructed. - **Data pre - processing**: - The ASL - DVS dataset has been pre - processed, including converting the original AEDAT format to CSV format for better data exploration and processing. - **Training and optimization**: - The model was trained on the Google Cloud platform using the Adam optimizer and the Mean Squared Error (MSE) loss function. By adjusting hyper - parameters such as the learning rate and batch size, high training and validation accuracy rates were achieved. ### Formula presentation: In CSNN, the membrane potential change of the Leaky Integrate and Fire (LIF) neuron can be represented by the following formula: \[ \tau \frac{dU(t)}{dt}=-U(t)+R I_{in} \] where: - \( U(t) \) is the membrane potential at time \( t \), - \( R \) is the membrane resistance, - \( I_{in} \) is the input weight matrix, - \( \tau \) is the time constant. The membrane potential update formula is: \[ U(t)=\beta U(t - 1)+(1-\beta) I_{in}(t) \] where \( \beta \) is the decay rate of the membrane potential. The convolutional operation can be represented as: \[ y[i, j]=\sum_{m = -\infty}^{\infty}\sum_{n = -\infty}^{\infty}x(i + m, j + n)\cdot K(m, n) \] where: - \( y[i, j] \) is the convolved feature map, - \( x \) is the input tensor, - \( K \) is the convolution kernel. Through these methods, the researchers successfully solved the problem of gesture recognition in the ASL - DVS dataset and demonstrated the potential of CSNN in processing event - driven data.