Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning

Zixuan Hu,Li Shen,Zhenyi Wang,Tongliang Liu,Chun Yuan,Dacheng Tao
2023-06-19
Abstract:The goal of data-free meta-learning is to learn useful prior knowledge from a collection of pre-trained models without accessing their training data. However, existing works only solve the problem in parameter space, which (i) ignore the fruitful data knowledge contained in the pre-trained models; (ii) can not scale to large-scale pre-trained models; (iii) can only meta-learn pre-trained models with the same network architecture. To address those issues, we propose a unified framework, dubbed PURER, which contains: (1) ePisode cUrriculum inveRsion (ECI) during data-free meta training; and (2) invErsion calibRation following inner loop (ICFIL) during meta testing. During meta training, we propose ECI to perform pseudo episode training for learning to adapt fast to new unseen tasks. Specifically, we progressively synthesize a sequence of pseudo episodes by distilling the training data from each pre-trained model. The ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model. We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner. During meta testing, we further propose a simple plug-and-play supplement-ICFIL-only used during meta testing to narrow the gap between meta training and meta testing task distribution. Extensive experiments in various real-world scenarios show the superior performance of ours.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in **Data - free Meta - Learning (DFML)**. Specifically, existing DFML methods mainly focus on solving problems in the parameter space, which brings the following three main issues: 1. **Ignoring the rich data knowledge in pre - trained models**: Existing methods only combine models in the parameter space without fully utilizing the data knowledge that can be extracted from pre - trained models. 2. **Inability to scale to large - scale pre - trained models**: Since neural networks are used to predict model parameters, existing methods can only be applied to small - scale pre - trained models. 3. **Only applicable to pre - trained models with the same network architecture**: The application scenarios of existing methods are limited by the fact that all pre - trained models must have the same architecture, which limits their application scope in the real world. To solve these problems, the authors propose a unified framework - **PURER**, which consists of two main components: - **Episode Curriculum Inversion (ECI)**: Conduct pseudo - episode training during data - free meta - training. Synthesize a series of pseudo - episodes by distilling training data from each pre - trained model, and adaptively increase the difficulty of pseudo - episodes according to the real - time feedback of the meta - model. - **Inversion Calibration following Inner Loop (ICFIL)**: Used during meta - testing to narrow the gap between meta - training and meta - testing task distributions. Through these innovations, PURER can utilize the latent data knowledge in pre - trained models without accessing the original training data, thereby significantly expanding the application scenarios of DFML and performing excellently in multiple benchmark tests. ### Formula Summary 1. **Inversion Loss**: \[ L_{\text{inv}}(D)=\sum_{(\hat{x}, y) \in D} l(\hat{x}, y; \psi)+R_{\text{prior}}(\hat{x})+R_{\text{feature}}(\hat{x}) \] where: - \( l(\hat{x}, y; \psi) \) is the classification loss function (e.g., cross - entropy loss). - \( R_{\text{prior}}(\hat{x})=\alpha_{\text{TV}} R_{\text{TV}}(\hat{x})+\alpha_{l2} R_{l2}(\hat{x}) \), which is used to guide the generated images away from unrealistic images. - \( R_{\text{feature}}(\hat{x})=\sum_l \|\mu_l(\hat{x})-\text{BN}_l(\text{running mean})\|+\sum_l \|\sigma^2_l(\hat{x})-\text{BN}_l(\text{running variance})\| \), which is used to minimize the distance between pseudo - images and feature maps of the original training images. 2. **Adversarial Optimization**: \[ \min_{\theta} \max_{D} \mathbb{E}_{T \in D}\left[-L_{\text{inv}}(D)+I(\Omega) \cdot L_{\text{outer}}(T; \theta)\right] \] where: - \( I(\Omega)=\begin{cases}1, & \text{if } \Omega \text{ is positive} \\ 0, & \text{if } \Omega \text{ is negative} \end{cases} \)