Enabling Fast and Memory-Efficient Acceleration for Pattern Matching Workloads: The Lightweight Automata Processing Engine
Lei Gong,Chao Wang,Haojun Xia,Xianglan Chen,Xi Li,Xuehai Zhou
DOI: https://doi.org/10.1109/tc.2022.3187338
IF: 3.183
2023-03-15
IEEE Transactions on Computers
Abstract:Growing pattern matching applications are employing finite automata as their basic processing model. These applications match tens to thousands of patterns on a large amount of data, which brings a great challenge to conventional processors. Therefore hardware-based solutions have emerged frequently and achieved high throuphput automata processing. However, existing methods are generally difficult to achieve both processing speed and storage efficiency, and are often too heavy to be integrated into a small chip and have to rely on off-chip DRAMs or other high capacity memories even on some simple data sets, leading to the potential area and power consumption issues. In this paper, we focus on building a more lightweight automata processing engine, hoping to store the whole automata model into on-chip memory and run effectively and independently. We propose LAP, a lightweight automata processing engine. Powered with a novel automata model (A-DFA) and efficient packing algorithms, extremely high storage efficiency compared with traditional DFA is achieved in LAP. Meanwhile, we identify the key parallelization factors in the A-DFA model and then propose a specialized microarchitecture with novel instructions to further accelerate the state transition process. As a result, LAP can obtain more effective trade-off between processing speed and storage efficiency. Evaluation results show that LAP achieves extremely high storage efficiency on simple data sets, exceeding IBM's RegX by 8×, and achieves significant improvements in processing speed ranging from 1.32× to 1.91× compared with previous lightweight hardware implementations. Moreover, LAP has good scalability in hardware architecture. It is easy to build an acceleration system with higher throughput by increasing the number of cores. We prototype a 16-core system into Xilinx ZC702 FPGA and a 64-core system into Xilinx ZCU102 FPGA respectively- The prototype system on ZC702 on average achieves 3.5 GB/s throughput on simple data sets, and the prototype system on ZCU102 can obtain higher throughput and compute density values on part of large datasets in ANMLZoo compared with modern in-memory NFA-based solutions.
engineering, electrical & electronic,computer science, hardware & architecture