TCPM: A Reconfigurable and Efficient Toom-Cook-Based Polynomial Multiplier over Rings Using a Novel Compressed Postprocessing Algorithm
Jianfei Wang,Chen Yang,Fahong Zhang,Yishuo Meng,Yang Su
DOI: https://doi.org/10.1109/tvlsi.2023.3277865
2023-01-01
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Abstract:Polynomial multiplication over rings is a significant bottleneck of ring learning with error (RLWE)-based encryption. To speed it up, three algorithms are widely used, i.e., number theoretic transform (NTT), Schoolbook, and Toom-Cook. Compared with Schoolbook and NTT, Toom-Cook can achieve a better trade-off between performance and flexibility. However, in Toom-Cook postprocessing, there are many redundant steps and calculations that have not been eliminated. Therefore, we propose an efficient, compressed, and fused Toom-Cook postprocessing algorithm that reduces the number of steps and at least 33.33% of the arithmetic operations of postprocessing. A highly reconfigurable and efficient Toom-Cook-based polynomial multiplier (TCPM) is proposed to speed up polynomial multiplication over rings. In TCPM, a high-throughput and efficient heterogeneous processing element (PE) array is designed to exploit the parallelism of Toom-Cook, and based on the compressed algorithm, the PE array for postprocessing is scaled down. In addition, as it is provided with a reconfigurable evaluation module, a flexible polynomial data storage module and a universal PE array, TCPM can efficiently map and execute Toom-Cook-2, 3, and 4 on a unified hardware architecture. Implemented on the Xilinx VC709 field-programmable gate array (FPGA) platform, TCPM can perform a Toom-Cook-4-based $256\times256$ polynomial multiplication over rings with a modulus of a power of two or a prime every 3.28 $\mu \text{s}$ at a 360-MHz clock frequency. It achieves a $2.47\times $ to $50.11\times $ speedup compared with the previous designs.