Learning to learn by gradient descent by gradient descent

NIPS 2016 8-24-2020

Motivation

The current optimization algorithms are still designed by hand. This paper shows how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way.

Idea:

meta learning to learn optimization method (like gradient descent)

This paper tries to replace the optimizers normally used for neural networks (eg Adam, RMSprop, SGD etc.) by a recurrent neural network (RNN). Gradient descent is fundamentally a sequence of updates (from the output layer of the neural net back to the input), in between which a state must be stored. We can think of an optimizer as a mini-RNN. The idea in this paper is to actually train that RNN instead of using a generic algorithm like Adam/SGD/etc..

There are 2 distinct neural nets, or parameterized functions. The first one is the task specific neural net, or the optimizee. This the neural network that performs the original task at hand. This task can be anything ranging from regression to image classification. The weights of this neural network is updated by another neural network, called the optimizer.

The loss of the optimizer is the sum of the losses of the optimizee as it learns. The paper includes some notion of weighing but gives a weight of 1 to everything, so that it indeed is just the sum.

\mathcal{L}(\phi)=\mathbb{E}_{f}\left[\sum_{t=1}^{T} w_{t} f\left(\theta_{2}\right)\right]

where

\begin{array}{c} \theta_{t+1}=\theta_{t}+g_{t} \\ {\left[\begin{array}{c} g_{t} \\ h_{t+1} \end{array}\right]=m\left(\nabla_{t}, h_{t}, \phi\right)} \\ \nabla_{t}=\nabla_{\theta} f\left(\theta_{t}\right) \end{array}

$w_t$ is arbitrary weight for each timestep. We will use $w_t=1$ for all timesteps.
$f(\cdot)$ is the optimizee function, $\theta_t$ is its parameters at time $t$ .
$m(\cdot)$ is the optimizer function, $\phi$ is its parameters, $h_t$ is its state at time $t$ .
$g_t$ is the update it outputs at time $t$ .

The idea is to use gradient on $\phi$ to minimize $\mathcal{L}(\phi)$ , which could give us an optimizer that is capable of optimizing $f$ efficiently.

The loss of the optimizer neural net is simply the average training loss of the optimizee as it is trained by the optimizer. The optimizer takes in the gradient of the current coordinate of the optimizee as well as its previous state, and outputs a suggested update that we hope will reduce the optimizee’s loss as fast as possible.

NOTE:

use meta learning to learn algorithms (like in anomly detection algorithm)

connection to other papers: like steepest gradient descent previously presented by Xiaofan as well.

Reference

PreviousMeta-Learning Representation for Continual Learning NextModular Meta-Learning with Shrinkage

Last updated 4 years ago

Was this helpful?