SortComp (Sort-and-compress) - Towards a Universal Lossless Compression Scheme for Matrix and Tabular Data

Xizhe Cheng,Sian-Jheng Lin,Jie Sun
DOI: https://doi.org/10.1109/dcc52660.2022.00046
2022-01-01
Abstract:A universal scheme is proposed for the lossless compression of two-dimensional tables and matrices. Instead of standard row- or column-based compression, we propose to sort each column first and record both the sorted table and the corresponding permutation table of the sorting permutations. These two tables are then separately compressed. In this new scheme, both intra- and inter-column correlations can be efficiently captured, giving rise to improved compression ratio in particular when both column-wise and row-wise dependencies cooccur. This scheme reduces the problem of the compression of an arbitrary two-dimensional table to that of a ‘permutation table’ together with a ‘sorted table’, where the former is only dependent on the table dimension and the latter can be effectively compressed column-by-column using predictive methods. Based on this scheme, a new algorithm is proposed, SortComp (sort-and-compress). For correlated columns, we give an estimation of the asymptotic bit rate of the algorithm and compare it to column-oriented compression schemes. Numerical experiments on real-life csv datasets validate the advantages of SortComp compared to existing row- and column-oriented compression algorithms.
What problem does this paper attempt to address?