Abstract:Contrastive learning has been effectively implemented in natural language processing, leveraging large and complex networks trained through data augmentation. Its extension to computer vision, particularly for training Vision Transformers (ViT) with small-scale Whole Slide Imaging (WSI) datasets, faces unique challenges due to the intrinsic characteristics of medical images. These challenges include the need for specialized data augmentation techniques informed by domain-specific knowledge in histopathology. Traditional methods of manually designing these augmentations can be labor-intensive, inefficient, and introduce bias. To overcome these issues, we introduce a novel method named WSI-Contrastive Fusion (WSI-CoFu) specifically designed for fine-tuning ViT models in a supervised contrastive learning framework, tailored for small-scale WSI datasets. This method leverages the auto-encoder capability of Transformers to learn diverse semantic projections of histopathological images. Additionally, we propose a unique loss function that integrates multi-level image semantics extracted from each attention block of the ViT, guiding the model to better understand the complex structures in WSI. The effectiveness of WSI-CoFu is demonstrated through the analysis of feature distributions on a hypersphere, showcasing the rich and diverse representations our method generates. Moreover, WSI-CoFu’s utility extends into transfer learning, showcasing remarkable adaptability, achieving significantly improved accuracy on different medical imaging tasks without extensive retraining, such as metastasis detection, or segmentation. Experimental results show that CoFu achieves an accuracy of \(91.3\%\) in ICIAR2018, \(90.5\%\) in Camelyon2016, and \(92.7\%\) in Camelyon2017, outperforming both traditional supervised learning frameworks and conventional data augmentation approaches on Small Scale WSI dataset. Notably, it also achieves significant gains in transfer learning on downstream dataset.

Supervised Fine-tuning in turn Improves Visual Foundation Models

Improved Visual Fine-tuning with Natural Language Supervision

Partial Fine-Tuning: A Successor to Full Fine-Tuning for Vision Transformers

Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners

Towards Compatible Fine-tuning for Vision-Language Model Updates

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Supervised Contrastive Learning Based Fine-tuning Framework with Small-Scale WSI Dataset on ViT

Semi-Supervised Fine-Tuning of Vision Foundation Models with Content-Style Decomposition

Robust Fine-Tuning of Vision-Language Models for Domain Generalization

VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness

Multifaceted Analysis of Fine-Tuning in Deep Model for Visual Recognition

Enhancing Vision-Language Pre-training with Rich Supervisions

Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation

Visual Fourier Prompt Tuning

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

FedTune: A Deep Dive into Efficient Federated Fine-Tuning with Pre-trained Transformers

CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet

Towards Realistic Unsupervised Fine-tuning with CLIP

Improving Vision Transformers by Revisiting High-Frequency Components

Connecting the Dots: Collaborative Fine-tuning for Black-Box Vision-Language Models

FDViT: Improve the Hierarchical Architecture of Vision Transformer.