Enhancing Lossy Compression Through Cross-Field Information for Scientific Applications

Youyuan Liu,Wenqi Jia,Taolue Yang,Miao Yin,Sian Jin
2024-09-27
Abstract:Lossy compression is one of the most effective methods for reducing the size of scientific data containing multiple data fields. It reduces information density through prediction or transformation techniques to compress the data. Previous approaches use local information from a single target field when predicting target data points, limiting their potential to achieve higher compression ratios. In this paper, we identified significant cross-field correlations within scientific datasets. We propose a novel hybrid prediction model that utilizes CNN to extract cross-field information and combine it with existing local field information. Our solution enhances the prediction accuracy of lossy compressors, leading to improved compression ratios without compromising data quality. We evaluate our solution on three scientific datasets, demonstrating its ability to improve compression ratios by up to 25% under specific error bounds. Additionally, our solution preserves more data details and reduces artifacts compared to baseline approaches.
Machine Learning,Artificial Intelligence,Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to enhance lossy compression techniques through cross - field information to improve the compression ratio of scientific data without compromising data quality. Specifically, existing lossy compression methods usually only use the local information of a single target field for prediction, which limits their potential to achieve a higher compression ratio. The authors observe that in scientific data sets, there are significant correlations between different fields, but these cross - field information are not fully utilized in traditional compression methods. ### Problem Background The amount of data generated by large - scale scientific simulations is extremely large. For example, a Nyx cosmological simulation with a resolution of 4,096×4,096×4,096 cells can generate data snapshots of up to 2.8 TB. If 5 simulations are run and 200 snapshots are generated each time, a total of 2.8 PB of storage space is required. This large - scale data brings two main challenges: 1. **Storage Difficulties**: Even for supercomputers, it is very difficult to completely store data of such a scale on disks. 2. **Inefficient Transmission and I/O**: A large amount of data consumes a great deal of time when transferred between files I/O or devices because the I/O bandwidth is limited. To solve these problems, lossy compression is an effective method, which can significantly reduce the amount of data on the premise of introducing data distortion within a controllable range. Compared with lossy compression, lossy compression is especially suitable for scientific data and can provide a higher compression ratio. However, most existing lossy compression methods rely on the local information of a single target field for prediction, ignoring the potential associations between different fields. ### Core Contributions of the Paper To overcome the limitations of existing methods, the authors propose a new hybrid prediction model that uses a convolutional neural network (CNN) to extract cross - field information and combines it with existing local field information. Specific contributions include: 1. **Identifying Cross - field Information**: Discovering significant correlations between different fields in the same data set, which are ignored by traditional compression methods. 2. **Proposing a CNN - based Model**: Designing a CNN model specifically for extracting and using cross - field information to improve prediction accuracy. 3. **Developing a Compact Model**: Effectively combining cross - field information with local information to improve overall prediction and compression performance. 4. **Evaluating Effectiveness**: Evaluating on multiple scientific data sets, and the results show that this method can increase the compression ratio by up to 25% within a specific error range, and retain more data details and reduce artifacts. ### Method Overview The method proposed in the paper mainly includes the following steps: 1. **Calculating the First - order Backward Difference of the Anchoring Field**: As input, it is used to predict the first - order backward difference of the target field. 2. **Using the CNN Model for Cross - field Prediction**: By learning the first - order backward difference of the anchoring field, predict the first - order backward difference of the target field. 3. **Combining with Traditional Predictors**: Combining the cross - field prediction results with traditional predictors based on local information (such as the Lorenzo predictor) to form a hybrid prediction model. 4. **Encoding and Lossless Compression**: Encoding and lossless compressing the prediction results, and finally obtaining the compressed data. Through this method, the authors show how to use cross - field information to significantly improve the effect of lossy compression, thus solving the key problems in scientific data compression.