Scaling Properties of Diffusion Models for Perceptual Tasks

Rahul Ravishankar,Zeeshan Patel,Jathushan Rajasegaran,Jitendra Malik
2024-11-13
Abstract:In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and segmentation under image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perception tasks. Through a careful analysis of these scaling behaviors, we present various techniques to efficiently train diffusion models for visual perception tasks. Our models achieve improved or comparable performance to state-of-the-art methods using significantly less data and compute. To use our code and models, see <a class="link-external link-https" href="https://scaling-diffusion-perception.github.io" rel="external noopener nofollow">this https URL</a> .
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to explore the scaling properties of diffusion models in visual perception tasks. Specifically, the authors seek to propose effective methods to improve the performance of these models across various visual perception tasks by studying the computational scaling behavior of diffusion models during training and inference stages. These tasks include, but are not limited to, monocular depth estimation, optical flow prediction, and modal segmentation. ### Main Contributions 1. **In-depth study of the training and inference scaling laws of diffusion models**: - The authors conducted detailed computational scaling studies on different levels of tasks (from low-level optical flow to mid-level depth estimation to high-level semantic segmentation). - By pre-training and fine-tuning dense models and mixture of experts models of different sizes, they studied the impact of model size, data resolution, and pre-training computation. 2. **Demonstrated how to apply the scaling laws of depth estimation to other tasks**: - The authors showed how the scaling laws derived from the depth estimation task can be applied to optical flow prediction and modal segmentation tasks, thereby improving performance during training and inference stages. 3. **Proposed efficient training strategies**: - The authors applied various efficient training strategies, such as converting dense model checkpoints to mixture of experts models (upcycling), and using different computational scaling techniques during testing (e.g., increasing diffusion steps, test-time ensembling, adding active model experts, etc.). 4. **Trained a general mixture of experts model**: - The authors trained a general mixture of experts model capable of performing all three visual perception tasks and achieved state-of-the-art results on multiple benchmarks. ### Method Overview - **Generative Pre-training**: - Pre-training with class-conditional image generation using a Diffusion Transformer (DiT) backbone. - Studied the impact of model size, mixture of experts models, image resolution, and pre-training computation on performance. - **Fine-tuning**: - Fine-tuning the pre-trained model on downstream perception tasks. - Unified various visual tasks into image-to-image translation tasks using a conditional denoising diffusion generation approach. - Studied the impact of model size, pre-training computation, image resolution, and upcycling on fine-tuning performance. - **Test-time Computational Scaling**: - Effectively scaled computational resources at test time by increasing diffusion steps, test-time ensembling, and adjusting noise variance scheduling. ### Experimental Results - **Depth Estimation**: - On the Hypersim dataset, the authors' model achieved validation performance comparable to Marigold but with significantly reduced training data and computational resources. - **Optical Flow Prediction**: - On the FlyingChairs dataset, the authors' model achieved endpoint error comparable to specialized optical flow prediction methods when using test-time ensembling. - **Modal Segmentation**: - On the pix2gestalt dataset, the authors' model demonstrated competitive performance across multiple datasets, trained only on the Pix2Gestalt dataset. - **General Model**: - The authors trained a general model that performed well across different tasks, demonstrating the generalization ability and transferability of their approach. ### Conclusion By systematically studying the scaling properties of diffusion models in visual perception tasks, the authors proposed various effective training and test-time computational scaling methods. These methods not only improved the performance of models across various tasks but also reduced the required training data and computational resources. The authors hope that these findings will inspire future related research.