Convolutional Networks as Extremely Small Foundation Models: Visual Prompting and Theoretical Perspective

Jianqiao Wangni
2024-09-03
Abstract:Comparing to deep neural networks trained for specific tasks, those foundational deep networks trained on generic datasets such as ImageNet classification, benefits from larger-scale datasets, simpler network structure and easier training techniques. In this paper, we design a prompting module which performs few-shot adaptation of generic deep networks to new tasks. Driven by learning theory, we derive prompting modules that are as simple as possible, as they generalize better under the same training error. We use a case study on video object segmentation to experiment. We give a concrete prompting module, the Semi-parametric Deep Forest (SDForest) that combines several nonparametric methods such as correlation filter, random forest, image-guided filter, with a deep network trained for ImageNet classification task. From a learning-theoretical point of view, all these models are of significantly smaller VC dimension or complexity so tend to generalize better, as long as the empirical studies show that the training error of this simple ensemble can achieve comparable results from a end-to-end trained deep network. We also propose a novel methods of analyzing the generalization under the setting of video object segmentation to make the bound tighter. In practice, SDForest has extremely low computation cost and achieves real-time even on CPU. We test on video object segmentation tasks and achieve competitive performance at DAVIS2016 and DAVIS2017 with purely deep learning approaches, without any training or fine-tuning.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the trade - off problem between model complexity and generalization error in the Video Object Segmentation (VOS) task. Specifically, the author attempts to design a lightweight prompting module so that the general - purpose deep network can perform few - shot adaptation on new tasks, thereby achieving good performance while maintaining low computational cost. #### Main problems and solutions 1. **Model complexity and generalization error**: - The paper explores the advantages of simple models (such as decision trees and random forests) over complex deep neural networks. According to Occam's Razor, simple models usually have better generalization ability under the same training error. - The author proposes to use a simple Semi - parametric Deep Forest (SDForest) in combination with non - parametric methods (such as correlation filters, random forests, image - guided filters, etc.) to reduce model complexity and improve generalization ability. 2. **Real - time performance and accuracy in video object segmentation**: - The video object segmentation task requires classifying each pixel and assigning it to different objects or the background. Existing deep learning methods are accurate but computationally expensive and difficult to achieve real - time processing. - The SDForest method proposed by the author not only achieves competitive performance on the DAVIS2016 and DAVIS2017 datasets but also can achieve real - time processing on the CPU, demonstrating its efficiency. 3. **Zero - shot and few - shot learning**: - Inspired by the success of zero - shot and few - shot prompting in pre - trained large models, the paper attempts to create a method with as little learning as possible, involving only test data. - By transferring the feature extractor from unrelated tasks (such as the ImageNet classification task) and instantaneously learning a simple model on the first frame, the over - fitting problem is avoided. #### Formula representation - **Prediction confidence formula**: \[ P(I)[i]=\mu(h(I)[i])+\gamma\sum_{q = 1}^{Q}E\{\omega_qI(h(I)[i]\in\pi_q)\} \] where \(h(I)\) is the general - purpose feature extracted from the input image \(I\), \(\pi_q\) is the feature space corresponding to the \(q\) - th leaf node, \(\omega_q\) is the objectness value assigned to the leaf node, \(\mu\) is a linear estimator, and \(\gamma\) is the weight of the forest estimator. - **Loss function**: \[ L(\psi,\mu,I,y)=\frac{1}{2wh}\sum_{i}\log(1+\exp(-y[i]\mu(I)[i])) \] - **Image - guided filter (IGF)**: \[ Q[i]=a_kI[i]+b_k,\quad\forall i\in\omega_k \] where \(\omega_k\) is the window centered on pixel \(k\), and \(a_k\) and \(b_k\) are the filtering coefficients determined by minimizing the energy function. In conclusion, this paper successfully solves the trade - off problem between model complexity and generalization error in the video object segmentation task by introducing a simple semi - parametric model and non - parametric methods, and demonstrates its advantages in terms of real - time performance and accuracy.