Strong but simple: A Baseline for Domain Generalized Dense Perception by CLIP-based Transfer Learning

Christoph Hümmer,Manuel Schwonberg,Liangwei Zhou,Hu Cao,Alois Knoll,Hanno Gottschalk
2024-10-31
Abstract:Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNNs), where domain shifts occur due to synthetic data, lighting, weather, or location changes. Vision-language models (VLMs) marked a large step for the generalization capabilities and have been already applied to various tasks. Very recently, first approaches utilized VLMs for domain generalized segmentation and object detection and obtained strong generalization. However, all these approaches rely on complex modules, feature augmentation frameworks or additional models. Surprisingly and in contrast to that, we found that simple fine-tuning of vision-language pre-trained models yields competitive or even stronger generalization results while being extremely simple to apply. Moreover, we found that vision-language pre-training consistently provides better generalization than the previous standard of vision-only pre-training. This challenges the standard of using ImageNet-based transfer learning for domain generalization. Fully fine-tuning a vision-language pre-trained model is capable of reaching the domain generalization SOTA when training on the synthetic GTA5 dataset. Moreover, we confirm this observation for object detection on a novel synthetic-to-real benchmark. We further obtain superior generalization capabilities by reaching 77.9% mIoU on the popular Cityscapes-to-ACDC benchmark. We also found improved in-domain generalization, leading to an improved SOTA of 86.4% mIoU on the Cityscapes test set marking the first place on the leaderboard.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the generalization ability of models on unseen target domains in computer vision tasks, especially in semantic segmentation and object detection tasks. Specifically, the paper focuses on how to make perception systems based on deep neural networks have stronger generalization ability when there are domain differences between training data and test data (such as the differences between synthetic data and real - world data, or data differences under different environmental conditions). The paper points out that existing methods usually rely on complex modules, feature - enhancement frameworks or additional models to deal with these domain differences. However, the authors find that by simply fine - tuning pre - trained vision - language models, the performance of existing methods can be achieved or even exceeded without using these complex techniques. This finding challenges the traditional standard practice of using ImageNet - based transfer learning as domain generalization. The main contributions of the paper include: - Demonstrating a simple CLIP - based transfer learning method that can show generalization ability comparable to or stronger than existing methods without using any additional modules or methods. - Comparing the generalization ability of the current state - of - the - art pre - training methods in dense perception tasks and verifying the effectiveness of vision - language pre - training. - Reaching the state - of - the - art (SOTA) in semantic segmentation in synthetic - to - real domain generalization and achieving significant performance improvements in multiple benchmark tests. In conclusion, this paper aims to provide a simple and effective solution to improve the generalization ability of deep learning models across different domains, especially in the absence of target - domain data.