Abstract:Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations such as random cropping and color jittering to create multiple views of an image. Recently, generative diffusion models have been shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train an initial SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Following training, given an embedding of the source image, this diffusion model can synthesize its diverse views. We show that these `self-augmentations', i.e. generative augmentations based on the vanilla SSL encoder embeddings, facilitate the training of a stronger SSL encoder. Furthermore, based on the ability to interpolate between images in the encoder latent space, we introduce the novel pretext task of disentangling the two source images of an interpolated synthetic image. We validate Gen-SIS's effectiveness by demonstrating performance improvements across various downstream tasks in both natural images, which are generally object-centric, as well as digital histopathology images, which are typically context-based.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to address the limitations of data augmentation methods that rely on manual design in self - supervised learning (SSL). Specifically, SSL methods train image encoders to maximize the similarity between the features of the same image from different views to achieve view - invariance tasks. Current SSL algorithms rely on manually - designed augmentation methods such as random cropping and color jittering to create multi - view images. However, these manually - designed augmentation methods have certain limitations, especially in specific fields (such as histopathology), where there is a lack of large - scale image - text datasets for pre - training. To solve this problem, the authors introduce **Gen - SIS**, a self - augmentation technique based on the generative diffusion model, which is trained only with unlabeled image data and does not require an external supervision source (such as text annotation) at all. In this way, Gen - SIS can generate more diverse image augmentations, thereby improving the learning effect of the SSL model. ### Main contributions of Gen - SIS 1. **Introduction of Gen - SIS**: This is the first generative diffusion - augmented SSL method that uses only unlabeled data. 2. **Proposing a new disentanglement task**: As an additional pre - training task, it enhances SSL training by disentangling the shared concepts between two source images. 3. **Extensive evaluation**: A large number of experiments were carried out on ImageNet - 1K, and the Gen - SIS pre - trained encoder was applied to downstream tasks such as classification, retrieval, copy detection, and video segmentation, significantly improving performance. 4. **Extension to histopathology images**: Applying self - augmented SSL to fields without a basic generative model, demonstrating its effectiveness and universality. ### Specific methods of the solution The core idea of Gen - SIS is to use the generative diffusion model to enhance SSL. The specific steps are as follows: 1. **Pre - train the SSL encoder**: First, pre - train an SSL encoder (such as DINO) on real images using traditional manual augmentation methods. 2. **Train the embedding - conditioned LDM**: Then, use the image embeddings extracted from the initial SSL encoder to train a latent diffusion model (LDM), namely E - LDM. 3. **Generate self - augmented images**: After training, E - LDM can generate diverse image augmentations according to the given source image embeddings. These augmentations include: - **Generative augmentation**: Generate augmented images from a single source image. - **Interpolation augmentation**: Generate interpolated images from two source images and use them for a new disentanglement pre - training task. 4. **Integrate into SSL training**: Use the generated augmented images together with the original real images for SSL training to further improve the performance of the encoder. Through these steps, Gen - SIS not only improves the robustness and generalization ability of the SSL model but also can achieve better performance in specific fields (such as histopathology).

Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning

GenSelfDiff-HIS: Generative Self-Supervision Using Diffusion for Histopathological Image Segmentation

SSL: A Self-similarity Loss for Improving Generative Image Super-resolution

Evolutionary Augmentation Policy Optimization for Self-supervised Learning

A Probabilistic Model Behind Self-Supervised Learning

Learning Where to Learn in Cross-View Self-Supervised Learning

Delineating the Effective Use of Self-Supervised Learning in Single-Cell Genomics

Weak Augmentation Guided Relational Self-Supervised Learning

Training Data Synthesis with Difficulty Controlled Diffusion Model

MixDiff: Mixing Natural and Synthetic Images for Robust Self-Supervised Representations

Giga-SSL: Self-Supervised Learning for Gigapixel Images

You Don't Need Data-Augmentation in Self-Supervised Learning

Adapting Self-Supervised Learning for Computational Pathology

Views Can Be Deceiving: Improved SSL Through Feature Space Augmentation

MSR: Making Self-supervised learning Robust to Aggressive Augmentations

On Learning Discriminative Features from Synthesized Data for Self-Supervised Fine-Grained Visual Recognition

SGCL: Spatial guided contrastive learning on whole-slide pathological images

Augmentations vs Algorithms: What Works in Self-Supervised Learning

DIAGen: Diverse Image Augmentation with Generative Models

Using Self-supervised Learning Can Improve Model Fairness