Abstract:Deep neural networks (DNNs) have been established as the state-of-the-art method for advanced machine learning applications. Recently proposed by the Google Brain’s team, the capsule networks (CapsNets) have improved the generalization ability, as compared to DNNs, due to their multidimensional capsules and preserving the spatial relationship between different objects. However, they pose significantly high computational and memory requirements, making their energy-efficient inference a challenging task. This article provides, for the first time, an in-depth analysis to highlight the design and runtime challenges for the (on-chip scratchpad) memories deployed in hardware accelerators executing fast CapsNets inference. To enable an efficient design, we propose an application-specific memory architecture, called DESCNet, which minimizes the off-chip memory accesses, while efficiently feeding the data to the hardware accelerator executing CapsNets inference. We analyze the corresponding on-chip memory requirement and leverage it to propose a methodology for exploring different scratchpad memory (SPM) designs and their energy/area tradeoffs. Afterward, an application-specific power-gating technique for the on-chip SPM is employed to further reduce its energy consumption, depending upon the mapped dataflow of the CapsNet and the utilization across different operations of its processing. We integrated our DESCNet memory design, as well as another state-of-the-art memory design Marchisio et al. [2018] for comparison studies, with an opensource DNN accelerator executing Google’s CapsNet model Sabour et al. [2017] for the MNIST dataset. We also enhanced the design to execute the recent deep CapsNet model Rajasegaran et al. [2019] for the CIFAR10 dataset. Note: we use the same benchmarks and test conditions for which these CapsNets have been proposed and evaluated by their respective teams. The complete hardware is synthesized for a 32-nm CMOS technology using the ASIC-design flow with Synopsys tools and CACTI-P, and detailed area, performance, and power/energy estimation is performed using different configurations. Our results for a selected Pareto-optimal solution demonstrate no performance loss and an energy reduction of 79% for the complete accelerator, including computational units and memories, when compared to the state-of-the-art design.

Energy Efficiency of Scratch-Pad Memory in Deep Submicron Domains: an Empirical Study

Energy Efficiency of Scratch-Pad Memory at 65 nm and Below: An Empirical Study

Study on the Low Power Cache Design Technique

Adaptive Energy-Aware Design Of A Multi-Bank Flash-Memory Storage System

Uncovering Phase Change Memory Energy Limits by Sub-Nanosecond Probing of Power Dissipation Dynamics

A Cache Reconfiguration Approach for Saving Leakage and Refresh Energy in Embedded DRAM Caches

Analyzing the Performance of 6T SRAM Cell and 64×64 Memory Array at Lower Technology Nodes for Low Power Design

Low-Power Low-Latency Data Allocation for Hybrid Scratch-Pad Memory

Energy Saving Techniques for Phase Change Memory (PCM)

Delay-Hiding energy management mechanisms for DRAM

An Energy-Efficient Heterogeneous Memory Architecture for Future Dark Silicon Embedded Chip-Multiprocessors

Energy- and Endurance-Aware Design of Phase Change Memory Caches

Efficient utilization of scratch-pad memory for embedded systems

Well Utilization of Cache-Aware Scratchpad Concerning the Influence of the Whole Embedded System

Efficient Utilization of Scratch-Pad Memory Banks

DESCNet: Developing Efficient Scratchpad Memories for Capsule Network Hardware

Designing Scratchpad Memory Architecture with Emerging STT-RAM Memory Technologies

Temperature-dependent Optimization of Cache Leakage Power Dissipation

Significant Power Consumption Reduction and Speed Boosting in Phase Change Memory with Nanocurrent Channels

Energy versus Output Quality of Non-volatile Writes in Intermittent Computing

Leakage Power Reduction Of Adiabatic Circuits Based On Finfet Devices