Abstract:In the past decade, deep learning is in a period of the rapid development, widely used in applications on different fields. In general, model training process will be deployed on cloud computing side, training on these small embedded devices is not recommended since it has the lower-end hardware configuration. Therefore these embedded devices are usually designed for inference. In this paper, a new image recognition framework based on heterogeneous multi core accelerator was established to achieve deep learning prediction process and improve the image recognition performance of embedded devices. At firstly, the fundamental principle of image recognition method based on deep learning reviewed as the basis of the study. And secondly, some important designs of CPU-Accelerator heterogeneous architecture based parallel image recognition framework included data splitting strategy framework architecture, data structure design and data parallelism were proposed to improve the recognition speed and the computational resource efficiency. Thirdly, Xilinx Zynq, Adapteva Epiphany combined hardware platform and the Rockchip RK3288 hardware platform were described in detail. Finally, an experiment of handwritten digits recognition was conducted to evaluate the accuracy and performance of this framework. The experimental results show that the proposed image recognition system can achieve nearly 8 times speedup as for recognized 28x28 image of ten handwritten digits and nearly 60 times speedup as for recognized 32x32 image of ten objects classification than RK3288 board which has the newest series of high-performance Arm core CPU as the control included 4 Arm A17 cores.

Performance Evaluation Of Deep Learning Frameworks On Embedded Gpu

Explore Training of Deep Convolutional Neural Networks on Battery-powered Mobile Devices: Design and Application

Deep Learning on Mobile and Embedded Devices: State-of-the-art, Challenges, and Future Directions

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

Deep Learning Frameworks Evaluation for Image Classification on Resource Constrained Device

A Deep Learning Frame on Embedded Multicore Processors Based on Caffe and Its Parallel Implementation

Enabling High Performance Deep Learning Networks on Embedded Systems

A deep learning image recognition framework accelerator based parallel computing

A Deep Learning Framework Performance Evaluation to Use YOLO in Nvidia Jetson Platform

Implementation and evaluation of deep neural networks (DNN) on mainstream heterogeneous systems

Cross Hardware-Software Boundary Exploration for Scalable and Optimized Deep Learning Platform Design

Benchmarking State-of-the-Art Deep Learning Software Tools

Performance of Convolution Neural Network based on Multiple GPUs with Different Data Communication Models

Performance Evaluation of Deep Learning Classification Network for Image Features

A Power Efficient Neural Network Implementation on Heterogeneous FPGA and GPU Devices

Benchmarking Deep Learning Frameworks and Investigating FPGA Deployment for Traffic Sign Classification and Detection

Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models

Energy Efficiency of Machine Learning in Embedded Systems Using Neuromorphic Hardware

SingleCaffe: an Efficient Framework for Deep Learning on a Single Node

A Deep Residual Networks Accelerator on FPGA