Scaling Training Data with Lossy Image Compression

Katherine L. Mentzer,Andrea Montanari
2024-07-25
Abstract:Empirically-determined scaling laws have been broadly successful in predicting the evolution of large machine learning models with training data and number of parameters. As a consequence, they have been useful for optimizing the allocation of limited resources, most notably compute time. In certain applications, storage space is an important constraint, and data format needs to be chosen carefully as a consequence. Computer vision is a prominent example: images are inherently analog, but are always stored in a digital format using a finite number of bits. Given a dataset of digital images, the number of bits $L$ to store each of them can be further reduced using lossy data compression. This, however, can degrade the quality of the model trained on such images, since each example has lower resolution. In order to capture this trade-off and optimize storage of training data, we propose a `storage scaling law' that describes the joint evolution of test error with sample size and number of bits per image. We prove that this law holds within a stylized model for image compression, and verify it empirically on two computer vision tasks, extracting the relevant parameters. We then show that this law can be used to optimize the lossy compression level. At given storage, models trained on optimally compressed images present a significantly smaller test error with respect to models trained on the original data. Finally, we investigate the potential benefits of randomizing the compression level.
Computer Vision and Pattern Recognition,Information Theory,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to optimize the storage and use of training data through lossy image compression to improve the performance of machine - learning models in the case of limited storage resources. Specifically, the author focuses on how to balance the image compression rate (i.e., the number of bits \(L\) required for each image) and the number of training samples (\(n\)) in computer vision tasks, so as to achieve the minimum test error within a given storage space. ### Main problems 1. **How to compare the performance of models trained on lossy - compressed images with those trained on uncompressed images?** 2. **What is the optimal compression level under given storage constraints?** 3. **How does lossy compression affect the data scaling law?** ### Research background As the scale of machine - learning training expands, data storage becomes an important issue. For example, the storage size of the LAION - 5B dataset is approximately 100 TB, and in the industry, the data scale even reaches petabytes (PB). The neural network scaling law shows that increasing the number of training samples can predictably reduce the error rate, but this needs to consider storage limitations. ### Main contributions of the paper The author proposes a "storage scaling law" that describes the joint evolution of the test error with the number of samples and the number of bits per image. They prove that this law holds in a simplified image compression model and empirically verify it through two computer vision tasks. In addition, they also show how to use this law to optimize the lossy compression level and train a model with a smaller test error in a given storage space. ### Key formula The paper assumes that within an appropriate range of \(n\) and \(L\), the test error can be approximated by the following scaling law: \[ \text{Err}_{\text{test}}(n, L) \approx \text{Err}^*_{\text{test}}+A\cdot n^{-\alpha}+B\cdot L^{-\beta} \] where \(\text{Err}^*_{\text{test}}\) is the minimum error that the model can reach, \(A\) and \(B\) are constants, and \(\alpha\) and \(\beta\) are scaling exponents. ### Experimental results The author verifies this scaling law through three computer vision tasks (image classification, semantic segmentation, and object detection). The experimental results show that the change of the test error with \(n\) and \(L\) conforms to this law, and by optimizing the compression level, the test error can be significantly reduced within a given storage space. ### Conclusion By introducing the "storage scaling law", the author provides a systematic method to select the optimal lossy compression level, thereby maximizing the model performance under limited storage resources. This not only helps save storage space but also improves the generalization ability of the model. ### Related work This research extends the existing scaling laws by taking the number of bits per sample as a relevant resource into consideration. Previous studies mainly focused on the influence of model complexity and the number of training samples on the test error, while this paper further explores the data scaling strategy in the case of storage - constrained situations.