Abstract:Pre-training and transfer learning are an important building block of current computer vision systems. While pre-training is usually performed on large real-world image datasets, in this paper we ask whether this is truly necessary. To this end, we search for a minimal, purely synthetic pre-training dataset that allows us to achieve performance similar to the 1 million images of ImageNet-1k. We construct such a dataset from a single fractal with perturbations. With this, we contribute three main findings. (i) We show that pre-training is effective even with minimal synthetic images, with performance on par with large-scale pre-training datasets like ImageNet-1k for full fine-tuning. (ii) We investigate the single parameter with which we construct artificial categories for our dataset. We find that while the shape differences can be indistinguishable to humans, they are crucial for obtaining strong performances. (iii) Finally, we investigate the minimal requirements for successful pre-training. Surprisingly, we find that a substantial reduction of synthetic images from 1k to 1 can even lead to an increase in pre-training performance, a motivation to further investigate ''scaling backwards''. Finally, we extend our method from synthetic images to real images to see if a single real image can show similar pre-training effect through shape augmentation. We find that the use of grayscale images and affine transformations allows even real images to ''scale backwards''.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper explores a core question: **Is large - scale pre - training really necessary?** Specifically, the authors question whether it is necessary to use a large number of real - world images (such as 1 million pictures in ImageNet - 1k) for pre - training in computer vision systems. They propose a minimalist synthetic pre - training data set and verify whether it can achieve performance comparable to that of large - scale pre - training data sets. #### Main research objectives: 1. **Finding the minimized synthetic pre - training data set**: By constructing a minimal synthetic data set generated from a single fractal graph, evaluate its performance in downstream tasks. 2. **Exploring the essence of pre - training**: Challenge the existing view that pre - training helps adapt to downstream tasks by discovering the general structure in large - scale data sets, but may instead be just a better weight initialization method. 3. **Reducing ethical and licensing issues**: If effective pre - training can be carried out using a small amount of or completely synthetic data, privacy and ethical issues brought by using a large number of real images can be avoided. #### Research background: - **Traditional pre - training methods**: Usually rely on large - scale real - world image data sets (such as ImageNet), which contain millions of labeled images. - **Self - supervised learning (SSL)**: In recent years, SSL has become a method for pre - training without manual annotation, but still requires a large amount of image data. - **Synthetic data pre - training**: Some studies have shown that pre - training can be carried out by generating synthetic images, thereby reducing the dependence on real images. #### Research contributions: 1. **Introducing the 1p - frac data set**: This is a minimalist synthetic data set generated from a single fractal graph and its perturbations. 2. **Proposing the local perturbation cross - entropy (LPCE) loss function**: Used for pre - training on a single fractal graph, enabling the neural network to learn to distinguish small perturbations. 3. **Experimental verification**: Through multiple experiments, it is proved that the performance of 1p - frac in some downstream tasks can be comparable to that of large - scale pre - training data sets (such as ImageNet - 1k), and even better in some cases. #### Experimental results: - **Performance on CIFAR - 100 and ImageNet - 1k**: The model pre - trained with 1p - frac performs better on these benchmark data sets than the model trained from scratch, and is close to or exceeds the model pre - trained with large - scale data sets. - **Shape enhancement of a single real image**: Further experiments show that a similar pre - training effect can also be obtained by performing geometric transformations (such as affine transformation, elastic transformation, etc.) on a single real image. In general, this paper challenges the traditional large - scale pre - training paradigm by proposing and verifying the minimalist synthetic data set 1p - frac, and provides new ideas and directions for future research.

Scaling Backwards: Minimal Synthetic Pre-training?

Improving Fractal Pre-training

Pre-training Vision Transformers with Very Limited Synthesized Images

Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Scaling Laws of Synthetic Images for Model Training ... for Now

Efficient Neural Network Training via Subset Pretraining

Can Synthetic Faces Undo the Damage of Dataset Bias to Face Recognition and Facial Landmark Detection?

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

Rethinking Pre-training and Self-training

Rethinking Training from Scratch for Object Detection

Synthetic Image Data for Deep Learning

Three-Dimensional Reconstruction Pre-Training as a Prior to Improve Robustness to Adversarial Attacks and Spurious Correlation

Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience

Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images

Training Vision Transformers with only 2040 Images.

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

Exploring the Limits of Weakly Supervised Pretraining

Physically-Based Rendering for Indoor Scene Understanding Using Convolutional Neural Networks

Reverse Knowledge Distillation: Training a Large Model using a Small One for Retinal Image Matching on Limited Data

If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces