Abstract:Lattice-Based Cryptography (LBC) schemes, like CRYSTALS-Kyber and CRYSTALS-Dilithium, have been selected to be standardized in the NIST Post-Quantum Cryptography standard. However, implementing these schemes in resourceconstrained Internet-of-Things (IoT) devices is challenging, considering efficiency, power consumption, area overhead, and flexibility to support various operations and parameter settings. Some existing ASIC designs that prioritize lower power and area can not achieve optimal performance efficiency, which are not practical for battery-powered devices. Custom hardware accelerators in prior co-processor and processor designs have limited applications and flexibility, incurring significant area and power overheads for IoT devices. To address these challenges, this paper presents an efficient lattice-based cryptography processor with customized Single-Instruction-Multiple-Data (SIMD) instruction. First, our proposed SIMD architecture supports efficient parallel execution of various polynomial operations in 256-bit mode and acceleration of Keccak in 320-bit mode, both utilizing efficiently reused resources. Additionally, we introduce data shuffling hardware units to resolve data dependencies within SIMD data. To further enhance performance, we design a dual-issue path for memory accesses and corresponding software design methodologies to reduce the impact of data load/store blocking. Through a hardware/software co-design approach, our proposed processor achieves high efficiency in supporting all operations in lattice-based cryptography schemes. Evaluations of Kyber and Dilithium show our proposed processor achieves over 10x speedup compared with the baseline RISC-V processor and over 5x speedup versus ARM Cortex M4 implementations, making it a promising solution for securing IoT communications and storage. Moreover, Silicon synthesis results show our design can run at 200 MHz with 2.01 mW for Kyber KEM 512 and 2.13 mW for Dilithium 2, which outperforms state-of-the-art works in terms of PPAP (Performance x Power x Area).

Morphling: A Reconfigurable Architecture for Tensor Computation

Morph: Flexible Acceleration for 3D CNN-based Video Understanding

MorphoSys: an Integrated Re-configurable Architecture

Modeling, Implementation and Scalability of the Morphosys Dynamically Reconfigurable Computing Architecture

Accelerating Edge AI with Morpher: An Integrated Design, Compilation and Simulation Framework for CGRAs

AIM: Accelerating Arbitrary-precision Integer Multiplication on Heterogeneous Reconfigurable Computing Platform Versal ACAP

Enabling One-Size-Fits-All Compilation Optimization for Inference Across Machine Learning Computers

Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture

Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization

CHARM: Composing Heterogeneous Accelerators for Matrix Multiply on Versal ACAP Architecture

PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR

Breaking Liebig's Law: An Advanced Multipurpose Neuromorphic Engine

A Multi-Layer Parallel Hardware Architecture for Homomorphic Computation in Machine Learning

Anole: A Highly Efficient Dynamically Reconfigurable Crypto-Processor for Symmetric-Key Algorithms

A Highly-efficient Lattice-based Post-Quantum Cryptography Processor for IoT Applications.

MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition

FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4bit-Compact Multilayer Perceptrons

Improving Transformer Inference Through Optimized Non-Linear Operations with Quantization-Approximation-Based Strategy

MARCA: Mamba Accelerator with ReConfigurable Architecture