πŸ“’
PaperNotes
  • PAPER NOTES
  • Meta-Learning with Implicit Gradient
  • DARTS: Differentiable Architecture Search
  • Meta-Learning of Neural Architectures for Few-Shot Learning
  • Towards Fast Adaptation of Neural Architectures with Meta Learning
  • Editable Neural Networks
  • ANIL (Almost No Inner Loop)
  • Meta-Learning Representation for Continual Learning
  • Learning to learn by gradient descent by gradient descent
  • Modular Meta-Learning with Shrinkage
  • NADS: Neural Architecture Distribution Search for Uncertainty Awareness
  • Modular Meta Learning
  • Sep
    • Incremental Few Shot Learning with Attention Attractor Network
    • Learning Steady-States of Iterative Algorithms over Graphs
      • Experiments
    • Learning combinatorial optimization algorithms over graphs
    • Meta-Learning with Shared Amortized Variational Inference
    • Concept Learners for Generalizable Few-Shot Learning
    • Progressive Graph Learning for Open-Set Domain Adaptation
    • Probabilistic Neural Architecture Search
    • Large-Scale Long-Tailed Recognition in an Open World
    • Learning to stop while learning to predict
    • Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift
    • Learning to Generalize: Meta-Learning for Domain Generalization
  • Oct
    • Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization
    • Network Architecture Search for Domain Adaptation
    • Continuous Meta Learning without tasks
    • Learning Causal Models Online
    • Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples
    • Conditional Neural Progress (CNPs)
    • Reviving and Improving Recurrent Back-Propagation
    • Meta-Q-Learning
    • Learning Self-train for semi-supervised few shot classification
    • Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards
  • Nov
    • Neural Process
    • Adversarially Robust Few-Shot Learning: A Meta-Learning Approach
    • Learning to Adapt to Evolving Domains
  • Tutorials
    • Relax constraints to continuous
    • MAML, FO-MAML, Reptile
    • Gradient Descent
      • Steepest Gradient Descent
      • Conjugate Gradient Descent
  • KL, Entropy, MLE, ELBO
  • Coding tricks
    • Python
    • Pytorch
  • ml
    • kmeans
Powered by GitBook
On this page
  • Motivation
  • Idea:
  • Reference

Was this helpful?

Learning to learn by gradient descent by gradient descent

NIPS 2016 8-24-2020

PreviousMeta-Learning Representation for Continual LearningNextModular Meta-Learning with Shrinkage

Last updated 4 years ago

Was this helpful?

Motivation

The current optimization algorithms are still designed by hand. This paper shows how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way.

Idea:

meta learning to learn optimization method (like gradient descent)

This paper tries to replace the optimizers normally used for neural networks (eg Adam, RMSprop, SGD etc.) by a recurrent neural network (RNN). Gradient descent is fundamentally a sequence of updates (from the output layer of the neural net back to the input), in between which a state must be stored. We can think of an optimizer as a mini-RNN. The idea in this paper is to actually train that RNN instead of using a generic algorithm like Adam/SGD/etc..

There are 2 distinct neural nets, or parameterized functions. The first one is the task specific neural net, or the optimizee. This the neural network that performs the original task at hand. This task can be anything ranging from regression to image classification. The weights of this neural network is updated by another neural network, called the optimizer.

The loss of the optimizer is the sum of the losses of the optimizee as it learns. The paper includes some notion of weighing but gives a weight of 1 to everything, so that it indeed is just the sum.

L(Ο•)=Ef[βˆ‘t=1Twtf(ΞΈ2)]\mathcal{L}(\phi)=\mathbb{E}_{f}\left[\sum_{t=1}^{T} w_{t} f\left(\theta_{2}\right)\right]L(Ο•)=Ef​[t=1βˆ‘T​wt​f(ΞΈ2​)]

where

ΞΈt+1=ΞΈt+gt[gtht+1]=m(βˆ‡t,ht,Ο•)βˆ‡t=βˆ‡ΞΈf(ΞΈt)\begin{array}{c} \theta_{t+1}=\theta_{t}+g_{t} \\ {\left[\begin{array}{c} g_{t} \\ h_{t+1} \end{array}\right]=m\left(\nabla_{t}, h_{t}, \phi\right)} \\ \nabla_{t}=\nabla_{\theta} f\left(\theta_{t}\right) \end{array}ΞΈt+1​=ΞΈt​+gt​[gt​ht+1​​]=m(βˆ‡t​,ht​,Ο•)βˆ‡t​=βˆ‡ΞΈβ€‹f(ΞΈt​)​
  • wtw_twt​ is arbitrary weight for each timestep. We will use wt=1w_t=1wt​=1 for all timesteps.

  • f(β‹…)f(\cdot)f(β‹…) is the optimizee function, ΞΈt\theta_tΞΈt​ is its parameters at time ttt .

  • m(β‹…)m(\cdot)m(β‹…) is the optimizer function, Ο•\phiΟ• is its parameters, hth_tht​ is its state at time ttt .

  • gtg_tgt​ is the update it outputs at time ttt .

The loss of the optimizer neural net is simply the average training loss of the optimizee as it is trained by the optimizer. The optimizer takes in the gradient of the current coordinate of the optimizee as well as its previous state, and outputs a suggested update that we hope will reduce the optimizee’s loss as fast as possible.

NOTE:

use meta learning to learn algorithms (like in anomly detection algorithm)

connection to other papers: like steepest gradient descent previously presented by Xiaofan as well.

Reference

The idea is to use gradient on Ο•\phiΟ• to minimize L(Ο•)\mathcal{L}(\phi)L(Ο•) , which could give us an optimizer that is capable of optimizing fff efficiently.

http://papers.nips.cc/paper/6461-learning-to-learn-by-gradient-descent-by-gradient-descent.pdf
https://medium.com/dataseries/learning-to-learn-gradient-descent-by-gradient-descent-a-paper-review-44292f2fb1ff
https://becominghuman.ai/paper-repro-learning-to-learn-by-gradient-descent-by-gradient-descent-6e504cc1c0de
https://www.slideshare.net/KatyLee4/learning-to-learn-by-gradient-descent-by-gradient-descent-78412835
https://blog.acolyer.org/2017/01/04/learning-to-learn-by-gradient-descent-by-gradient-descent/