Abstract:Brain-machine interface (BMI) technology [1] offers an exciting means to study and to communicate with the complex brain. Neural activities can be observed or recorded through a range of neurophysiological measuring techniques and apparatus, such as functional magnetic resonance imaging (fMRI), electroencephalography (EEG) and multi-electrode arrays. These neurophysiological measuring systems are becoming the key components in many emerging neuro-prostheses and neuro-rehabilitation applications. With the rapid advance of micro-electrode technology, the temporal and spatial resolution of the electrode arrays increases drastically [2]. This greatly enhances the neural recording throughput and enables capability of studying large neural network ensembles. However, the dramatic increase in data bandwidth and data volume associated with multichannel recording requires a significant computational effort. Because of involving statistical operations and iterative numerical procedures, most neural signal analysis algorithms [3] are highly computational intensive. As a result, software-based approach for multichannel neural signal analysis often requires off-line processing. Reconfigurable system, such as Filed Programmable Gate Arrays (FPGAs), embeds massively parallel computational resources and provides an effective alternative for real-time neural signal processing and data mining for multichannel neural recordings. Neural signal processing and data mining usually comprise multiple steps of spikes filtering, feature extractions and statistical computations due to the poor signal-to-noise ratio of the recorded action potentials and complex spike encodings. These complex signal processing routines are highly computational expensive. As a result, there is a major design challenge for reconfigurable system design in terms of power dissipation and hardware area. In this poster, we present a reconfigurable kernel design methodology that exploits the self-similarity nature of neural spikes and, thus, eliminates the need of temporal storage in signal processing. Three aspects are presented in this poster. First, a spikestreaming processing design principle that leads to efficient hardware implementation is presented. This design principle is further exemplified by several commonly used neural signal analysis algorithms including spike feature extractions (principal component analysis (PCA)), the covariance analysis (covariance matrix calculation), multi-channel signals separation (independent component analysis (ICA)), and clustering algorithms (k-means algorithm). Second, an FPGA-based hardware implementation methodology using the streaming based algorithm is presented. The design of a streaming kernel for spike feature extraction is presented as an example to illustrate the idea of memory reduction in streaming architecture design. Third, the proposed streaming method is examined by comparing with traditional batch processing approach over the above mentioned neural signal analysis algorithms. Real clinical data, synthetic spike trains, synthetic spike times are utilized to verify our streaming method. The reductions on hardware resources and power consumption are also rigorously evaluated using Xilinx FPGA devices. The software evaluation results show that the proposed streaming method provides an approximation to the original batch processing algorithm. In the case of spike train analysis, it can achieve similar results as the original algorithm, due to similarities in spike train. The accuracy of the streaming method depends on the streaming window size or the number of data for the streaming. The hardware evaluation results show that the memory and power saved by the streaming method depend on how much data is used in batch processing method and algorithms. We use Xilinx System Generator as design tool and perform power analysis through Xilinx Xpower. Hardware resource utilization is reported by Xilinx ISE. From the result we know that 16.6% to 54% power consumption can be reduced by using our streaming method if implementing algorithms on Virtex6, and 8.3% to 67% power can saved if implementing algorithms on Spartan6. BRAMs usage in all implementations can also be greatly reduced by using our streaming approach.

Streaming Batch Gradient Tracking for Neural Network Training (student Abstract).

Memory-efficient training with streaming dimensionality reduction

Streaming Batch Eigenupdates for Hardware Neural Networks.

Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays

Gradient Decomposition Methods for Training Neural Networks with Non-ideal Synaptic Devices

A Gradient-Interleaved Scheduler for Energy-Efficient Backpropagation for Training Neural Networks

GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training.

Memory Efficient On-Line Streaming for Multichannel Spike Train Analysis

Pipelined Backpropagation at Scale: Training Large Models without Batches

Gradient Compression Supercharged High-Performance Data Parallel DNN Training.

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters.

Sparse Gradient Compression For Distributed Sgd

ON TRAINING DEEP NEURAL NETWORKS USING A STREAMING APPROACH

Multiplexed gradient descent: Fast online training of modern datasets on hardware neural networks without backpropagation

Mini-batch Gradient Descent with Buffer

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

Deep Neural Network Training with Distributed K-FAC

Batch Adaptative Streaming for Video Analytics

Online Learning for DNN Training: A Stochastic Block Adaptive Gradient Algorithm

Reconfigurable Streaming Kernels for Multichannel Neurophysiological Recording Systems

Grad Queue : A probabilistic framework to reinforce sparse gradients