MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters

Chau Pham,Piotr Teterwak,Soren Nelson,Bryan A. Plummer

2023-11-07

Abstract:Most deep neural networks are trained under fixed network architectures and require retraining when the architecture changes. If expanding the network's size is needed, it is necessary to retrain from scratch, which is expensive. To avoid this, one can grow from a small network by adding random weights over time to gradually achieve the target network size. However, this naive approach falls short in practice as it brings too much noise to the growing process. Prior work tackled this issue by leveraging the already learned weights and training data for generating new weights through conducting a computationally expensive analysis step. In this paper, we introduce MixtureGrowth, a new approach to growing networks that circumvents the initialization overhead in prior work. Before growing, each layer in our model is generated with a linear combination of parameter templates. Newly grown layer weights are generated by using a new linear combination of existing templates for a layer. On one hand, these templates are already trained for the task, providing a strong initialization. On the other, the new coefficients provide flexibility for the added layer weights to learn something new. We show that our approach boosts top-1 accuracy over the state-of-the-art by 2-2.5% on CIFAR-100 and ImageNet datasets, while achieving comparable performance with fewer FLOPs to a larger network trained from scratch. Code is available at <a class="link-external link-https" href="https://github.com/chaudatascience/mixturegrowth" rel="external noopener nofollow">this https URL</a>.

Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address a critical issue encountered when training neural networks: how to initialize the weights of newly added layers when scaling up the network, in order to avoid the high cost of retraining the entire network. Specifically, the paper proposes a new method called **MixtureGrowth** to achieve this goal through the following approaches: 1. **Utilizing Template Mixing**: Generating new weights by linearly combining shared parameter templates, thereby avoiding the traditional retraining process. 2. **Optimizing Initialization Strategies**: Investigating various initialization strategies, including random coefficients, copying existing coefficients, and orthogonal coefficients, to find the optimal initialization method. 3. **Merging Two Small Models**: Exploring methods to merge two independently trained small models to enhance the performance of the final model. Through these methods, MixtureGrowth is able to achieve performance comparable to or even better than that of a large model trained from scratch, with significantly fewer computational resources. Experimental results show that on the CIFAR-100 and ImageNet datasets, the MixtureGrowth method improves top-1 accuracy by 2-2.5% over existing state-of-the-art methods, while significantly reducing the required floating-point operations (FLOPs).

MixtureGrowth: Growing Neural Networks by Recombining Learned Parameters

Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation

When To Grow? A Fitting Risk-Aware Policy for Layer Growing in Deep Neural Networks

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Learning Morphisms with Gauss-Newton Approximation for Growing Networks

Growing Deep Neural Network Considering with Similarity between Neurons

Deep Fusion: Efficient Network Training via Pre-trained Initializations

MeGA: Merging Multiple Independently Trained Neural Networks Based on Genetic Algorithm

Overcoming Growth-Induced Forgetting in Task-Agnostic Continual Learning

Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

G-Mix: A Generalized Mixup Learning Framework Towards Flat Minima

Data-Efficient Augmentation for Training Neural Networks

Intelligent gradient amplification for deep neural networks

Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally

Gradually Updated Neural Networks for Large-Scale Image Recognition

Neural networks grown and self-organized by noise

DeepME: Deep Mixture Experts for Large-scale Image Classification

Neural-g: A Deep Learning Framework for Mixing Density Estimation

Dynamically Grown Generative Adversarial Networks