Abstract:This paper delves into recent hardware implementations of the Lempel-Ziv 4 (LZ4) algorithm, highlighting two key factors that limit the throughput of single-kernel compressors. Firstly, the actual parallelism exhibited in single-kernel designs falls short of the theoretical potential. Secondly, the clock frequency is constrained due to the presence of the feedback loops. To tackle these challenges, we propose a novel scheme that restricts each parallelization window to a single match, thus elevating the level of actual parallelism. Furthermore, by restricting the maximum match length, we eliminate the feedback loops within the architecture, enabling a significant boost in throughput. Finally, we present a high-speed hardware architecture. The implementation results demonstrate that the proposed architecture achieves a throughput of up to 16.10 Gb/s, exhibiting a 2.648x improvement over the start-of-the-art. The new design only results in an acceptable compression ratio reduction ranging from 4.93% to 11.68% with various numbers of hash table entries, compared to the LZ4 compression ratio achieved by official software implementations disclosed on GitHub.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main bottleneck problems in the hardware implementation of the Lempel - Ziv 4 (LZ4) compression algorithm in order to improve the compression throughput: 1. **Insufficient actual parallelism**: The actual parallelism in the single - core design fails to fully realize its theoretical potential. Specifically, in the existing parallel architectures, in order to optimize the compression ratio, some implementations adopt the FIFO structure or adjust the starting address of the parallel window, which results in the effective parallelism being lower than the theoretical parallelism (PWS), thus limiting the overall throughput. 2. **Limited clock frequency**: Due to the existence of feedback loops, especially the time - dependence of the address signal in the extended matching stage, this hinders further increasing the frequency through pipeline insertion, thus limiting the upper limit of the throughput. To solve these problems, the author proposes the following two schemes: - **Enhancing parallelism**: By restricting each parallel window to process only one matching item, ensuring that the parallelism of each component is consistent, avoiding the decrease in parallelism caused by multiple non - overlapping matches. - **Eliminating feedback loops**: By limiting the maximum matching length, the feedback loops in the architecture are eliminated, making the circuit a fully feed - forward structure, so that the frequency can be increased by inserting pipelines. Finally, these improvements enable the proposed architecture to achieve a throughput of up to 16.10 Gbps on the FPGA, which is 2.648 times higher than the best - existing architecture. Although these improvements lead to a decrease in the compression ratio (about 4.93% to 11.68%), in many application scenarios, the importance of throughput exceeds that of the compression ratio. ### Formula summary The formulas involved in this paper are mainly the calculation of the compression ratio: \[ \text{Compression ratio}=\frac{\text{Average size of all original files}}{\text{Average size of all compressed files}} \] ### Conclusion This paper successfully solves the throughput bottleneck problem in the hardware implementation of the LZ4 compression algorithm by limiting the number of matches in the parallel window and the maximum matching length, significantly improving the compression speed, although sacrificing the compression ratio to a certain extent.

A High-Throughput Hardware Accelerator for Lempel-Ziv 4 Compression Algorithm

Hardware Implementation of Fast Huffman Coding Based on Different Sorting Methods

BeeZip: Towards an Organized and Scalable Architecture for Data Compression

Design and Optimization of Zstandard Algorithm Based on Concurrent Streaming of Multiple Hash Tables

Data Compression and Storage under High Speed Network

MetaZip: a high-throughput and efficient accelerator for DEFLATE

FPGA Acceleration of Zstd Compression Algorithm

HybriDC: A Resource-Efficient CPU-FPGA Heterogeneous Acceleration System for Lossless Data Compression

A Hardware Implementation of Real Time Lossless Data Compression and Decompression Circuits

GPULZ: Optimizing LZSS Lossless Compression for Multi-byte Data on Modern GPUs

UH-JLS: A Parallel Ultra-High Throughput JPEG-LS Encoding Architecture for Lossless Image Compression

An Efficient High-Throughput LZ77-Based Decompressor in Reconfigurable Logic

MetaZip

A High Compression Efficiency Hardware Encoder for Intra and Inter Coding with 4k@30fps Throughput

Streaming Sorting Network Based BWT Acceleration on FPGA for Lossless Compression.

Hardware implementation of transform and quantization for AVS encoder

A Versatile Compression Method for Floating-Point Data Stream

A High Throughput and Energy Efficient Lepton Hardware Encoder with Hash-based Memory Optimization

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

FPGA bitstream compression and decompression using LZ and golomb coding (abstract only).

Refine and Recycle: A Method to Increase Decompression Parallelism.