HitIct: Chinese Corpus for the Evaluation of Lossless Compression Algorithms

常为领,云晓春,方滨兴,王树鹏
DOI: https://doi.org/10.3321/j.issn:1000-436x.2009.03.007
2009-01-01
Abstract:HitIct, a Chinese corpus for the evaluation of lossless compression algorithms based on ANSI code, was proposed.In accordance with the principle of application representativeness, Complementary principle and openness principle, a large number of candidate files were obtained from the Internet, and then average compression ratio, average correlation coefficient, compression ratio correlation coefficient and standard deviation were used to select the files that give the most accurate indication of the overall performance of compression algorithms.Experimental results show that this collection has a good representativeness and stability, and can be used as the supplementary test set of the main benchmark for comparing compression methods.
What problem does this paper attempt to address?