Abstract:We introduce a new regularization method for Artificial Neural Networks (ANNs) based on Kernel Flows (KFs). KFs were introduced as a method for kernel selection in regression/kriging based on the minimization of the loss of accuracy incurred by halving the number of interpolation points in random batches of the dataset. Writing $f_\theta(x) = \big(f^{(n)}_{\theta_n}\circ f^{(n-1)}_{\theta_{n-1}} \circ \dots \circ f^{(1)}_{\theta_1}\big)(x)$ for the functional representation of compositional structure of the ANN, the inner layers outputs $h^{(i)}(x) = \big(f^{(i)}_{\theta_i}\circ f^{(i-1)}_{\theta_{i-1}} \circ \dots \circ f^{(1)}_{\theta_1}\big)(x)$ define a hierarchy of feature maps and kernels $k^{(i)}(x,x')=\exp(- \gamma_i \|h^{(i)}(x)-h^{(i)}(x')\|_2^2)$. When combined with a batch of the dataset these kernels produce KF losses $e_2^{(i)}$ (the $L^2$ regression error incurred by using a random half of the batch to predict the other half) depending on parameters of inner layers $\theta_1,\ldots,\theta_i$ (and $\gamma_i$). The proposed method simply consists in aggregating a subset of these KF losses with a classical output loss. We test the proposed method on CNNs and WRNs without alteration of structure nor output classifier and report reduced test errors, decreased generalization gaps, and increased robustness to distribution shift without significant increase in computational complexity. We suspect that these results might be explained by the fact that while conventional training only employs a linear functional (a generalized moment) of the empirical distribution defined by the dataset and can be prone to trapping in the Neural Tangent Kernel regime (under over-parameterizations), the proposed loss function (defined as a nonlinear functional of the empirical distribution) effectively trains the underlying kernel defined by the CNN beyond regressing the data with that kernel.

Module-wise Training of Neural Networks via the Minimizing Movement Scheme

Block-wise Training of Residual Networks via the Minimizing Movement Scheme

A Framework for Provably Stable and Consistent Training of Deep Feedforward Networks

Accelerated Training via Incrementally Growing Neural Networks using Variance Transfer and Learning Rate Adaptation

Explicit Regularization via Regularizer Mirror Descent

Train Faster, Perform Better: Modular Adaptive Training in Over-Parameterized Models

Scalable Optimization in the Modular Norm

Deep regularization and direct training of the inner layers of Neural Networks with Kernel Flows

Greedy Layer-Wise Training of Long Short Term Memory Networks

Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Regularization Neural Networks Via Constrained Virtual Movement Field

Go beyond End-to-End Training: Boosting Greedy Local Learning with Context Supply

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

Layerwise Optimization by Gradient Decomposition for Continual Learning

Multitask Learning With Enhanced Modules.

Accelerated Gradient-free Neural Network Training by Multi-convex Alternating Optimization

Speeding up the Training of Neural Networks with the One-Step Procedure

Efficient Neural Network Training Via Forward and Backward Propagation Sparsification

Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints

Two steps at a time -- taking GAN training in stride with Tseng's method