A High-Throughput Hardware Accelerator for Lempel-Ziv 4 Compression Algorithm

Tao Chen,Suwen Song,Zhongfeng Wang
2024-09-19
Abstract:This paper delves into recent hardware implementations of the Lempel-Ziv 4 (LZ4) algorithm, highlighting two key factors that limit the throughput of single-kernel compressors. Firstly, the actual parallelism exhibited in single-kernel designs falls short of the theoretical potential. Secondly, the clock frequency is constrained due to the presence of the feedback loops. To tackle these challenges, we propose a novel scheme that restricts each parallelization window to a single match, thus elevating the level of actual parallelism. Furthermore, by restricting the maximum match length, we eliminate the feedback loops within the architecture, enabling a significant boost in throughput. Finally, we present a high-speed hardware architecture. The implementation results demonstrate that the proposed architecture achieves a throughput of up to 16.10 Gb/s, exhibiting a 2.648x improvement over the start-of-the-art. The new design only results in an acceptable compression ratio reduction ranging from 4.93% to 11.68% with various numbers of hash table entries, compared to the LZ4 compression ratio achieved by official software implementations disclosed on GitHub.
Hardware Architecture,Signal Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two main bottleneck problems in the hardware implementation of the Lempel - Ziv 4 (LZ4) compression algorithm in order to improve the compression throughput: 1. **Insufficient actual parallelism**: The actual parallelism in the single - core design fails to fully realize its theoretical potential. Specifically, in the existing parallel architectures, in order to optimize the compression ratio, some implementations adopt the FIFO structure or adjust the starting address of the parallel window, which results in the effective parallelism being lower than the theoretical parallelism (PWS), thus limiting the overall throughput. 2. **Limited clock frequency**: Due to the existence of feedback loops, especially the time - dependence of the address signal in the extended matching stage, this hinders further increasing the frequency through pipeline insertion, thus limiting the upper limit of the throughput. To solve these problems, the author proposes the following two schemes: - **Enhancing parallelism**: By restricting each parallel window to process only one matching item, ensuring that the parallelism of each component is consistent, avoiding the decrease in parallelism caused by multiple non - overlapping matches. - **Eliminating feedback loops**: By limiting the maximum matching length, the feedback loops in the architecture are eliminated, making the circuit a fully feed - forward structure, so that the frequency can be increased by inserting pipelines. Finally, these improvements enable the proposed architecture to achieve a throughput of up to 16.10 Gbps on the FPGA, which is 2.648 times higher than the best - existing architecture. Although these improvements lead to a decrease in the compression ratio (about 4.93% to 11.68%), in many application scenarios, the importance of throughput exceeds that of the compression ratio. ### Formula summary The formulas involved in this paper are mainly the calculation of the compression ratio: \[ \text{Compression ratio}=\frac{\text{Average size of all original files}}{\text{Average size of all compressed files}} \] ### Conclusion This paper successfully solves the throughput bottleneck problem in the hardware implementation of the LZ4 compression algorithm by limiting the number of matches in the parallel window and the maximum matching length, significantly improving the compression speed, although sacrificing the compression ratio to a certain extent.