Meta-Learning with Warped Gradient Descent.

Sebastian Flennerhag,Andrei A. Rusu,Razvan Pascanu,Francesco Visin,Hujun Yin,Raia Hadsell
DOI: https://doi.org/10.48550/arxiv.1909.00025
2020-01-01
Abstract:Learning an efficient update rule from data that promotes rapid learning ofnew tasks from the same distribution remains an open problem in meta-learning.Typically, previous works have approached this issue either by attempting totrain a neural network that directly produces updates or by attempting to learnbetter initialisations or scaling factors for a gradient-based update rule.Both of these approaches pose challenges. On one hand, directly producing anupdate forgoes a useful inductive bias and can easily lead to non-convergingbehaviour. On the other hand, approaches that try to control a gradient-basedupdate rule typically resort to computing gradients through the learningprocess to obtain their meta-gradients, leading to methods that can not scalebeyond few-shot task adaptation. In this work, we propose Warped GradientDescent (WarpGrad), a method that intersects these approaches to mitigate theirlimitations. WarpGrad meta-learns an efficiently parameterised preconditioningmatrix that facilitates gradient descent across the task distribution.Preconditioning arises by interleaving non-linear layers, referred to aswarp-layers, between the layers of a task-learner. Warp-layers are meta-learnedwithout backpropagating through the task training process in a manner similarto methods that learn to directly produce updates. WarpGrad is computationallyefficient, easy to implement, and can scale to arbitrarily large meta-learningproblems. We provide a geometrical interpretation of the approach and evaluateits effectiveness in a variety of settings, including few-shot, standardsupervised, continual and reinforcement learning.
What problem does this paper attempt to address?