Meta-Learning with Implicit Gradient

NIPS 2019

8/18/2020

https://papers.nips.cc/paper/8306-meta-learning-with-implicit-gradients

Motivation

There are some limitations in MAML. The meta-learning process requires higher-order derivatives, imposes a non-trivial computational and memory burden, and can suffer from vanishing gradients. These limitations make it harder to scale optimization-based meta learning methods to tasks involving medium or large datasets, or those that require many inner-loop optimization steps.

Contributions

The development of the implicit MAML (iMAML) algorithm, an approach for optimization-based meta-learning with deep neural networks that removes the need for differentiating through the optimization path.

The algorithm aims to learn a set of parameters such that an optimization algorithm that is initialized at and regularized to this parameter vector leads to good generalization for a variety of learning tasks.

Few-shot supervised learning and MAML (Bi-level optimization)

The goal of meta-learning is to learn meta-parameters that produce good task specific parameters after adaptation.

Notations:

  • θML\theta^*_{ML} optimal meta-learned parameters

  • MM the number of tasks in meta-train, ii is the index of task ii

  • Ditr\mathcal{D}^{tr}_i support set, Ditest\mathcal{D}^{test}_i query set in task ii

  • L(ϕ,D)\mathcal{L}(\phi, \mathcal{D}) loss function with parameter vector and dataset

  • ϕi=Alg(θ,Ditr)=θαθL(θ,Ditr)\phi_i = \mathcal{A}lg(\theta, \mathcal{D}^{tr}_{i})=\theta - \alpha\nabla_{\theta}\mathcal{L}(\theta,\mathcal{D}^{tr}_i) : one (or multiple) steps of gradient descent initialized at θ\theta. [inner-level of MAML]

Proximal regularization in the inner level

Two challenges in MAML in cases like ill-conditioned optimization landscapes and medium-shot learning, we may want to take many gradient steps:

  • we need to store and differentiate through the long optimization path of Alg\mathcal{A}lg, imposing a considerable computation and memory burden

  • The dependence of the model-parameters {ϕi\phi_i} on the meta-parameters ( θ\theta ) shrinks and vanishes as the number of gradient steps in Alg\mathcal{A}lg grows, making meta-learning difficult.

A more explicitly regularized algorithm is considered:

New Bi-level optimization problem after proximal regularization

Simplify the notation:

Li(ϕ):=L(ϕ,Ditest),L^i(ϕ):=L(ϕ,Ditr),Alg(θ):=Alg(θ,Ditr)\mathcal{L}_i(\phi):=\mathcal{L}(\phi,\mathcal{D}_i^{test}), \hat{\mathcal{L}}_i(\phi):=\mathcal{L}(\phi,\mathcal{D}_i^{tr}), \mathcal{A}lg(\theta):=\mathcal{A}lg(\theta,\mathcal{D}^{tr}_i)

bi-level meta-learning problem (more general)

Total and Partial Derivatives:

( d\boldsymbol{d} denotes the total derivative, \nabla denotes the partial derivative)

dθLi(Algi(θ))=dAlgi(θ)dθϕLi(ϕ)ϕ=Algi(θ)=dAlgi(θ)dθϕLi(Algi(θ))\boldsymbol{d}_{\theta}\mathcal{L}_i(\mathcal{A}lg_i(\theta)) = \frac{\boldsymbol{d}\mathcal{A}lg_i(\theta)}{d\theta}\nabla_{\phi}\mathcal{L}_i(\phi)|_{\phi=\mathcal{A}lg_i(\theta)} = \frac{\boldsymbol{d}\mathcal{A}lg_i(\theta)}{d\theta}\nabla_{\phi}\mathcal{L}_i(\mathcal{A}lg_i(\theta))

Implicit MAML

Goal: to solve the bi-level meta-learning problem in Eq (4) using an iterative gradient based algorithm of the form:

θθηdθF(θ)\theta \leftarrow \theta -\eta d_{\theta}F(\theta)

Specifically:

θθη1Mi=1MdAlgi(θ)dθϕLi(Algi(θ))\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\eta \frac{1}{M} \sum_{i=1}^{M} \frac{d \mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})}{d \boldsymbol{\theta}} \nabla_{\phi} \mathcal{L}_{i}\left(\mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})\right)

Note: many available iterative gradient based algorithm and other optimization methods could be used here:

Meta-Gradient Computation

In theory

Theoretically we can calculate the meta-gradient computation dAlgi(θ)dθ\frac{d \mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})}{d \boldsymbol{\theta}} exactly using the following lemma.

lemma 1: (Implicit Jacobian) Consider Algi(θ)\mathcal{A}lg_i^{\star}(\theta) as defined in Eq.4 for task Ti\mathcal{T}_i. Let ϕi=Algi(θ)\phi_i=\mathcal{A}lg_i^{\star}(\theta) be the results of Algi(θ)\mathcal{A}lg_i^{\star}(\theta). If (I+1λϕ2Li^(ϕi))(I+\frac{1}{\lambda}\nabla^2_{\phi}\hat{\mathcal{L}_i}(\phi_i) ) is invertible, then the derivative Jacobian is

dAlgi(θ)dθ=(I+1λϕ2Li^(ϕi))1\frac{d \mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})}{d \boldsymbol{\theta}} = (I+\frac{1}{\lambda}\nabla^2_{\phi}\hat{\mathcal{L}_i}(\phi_i) )^{-1} (6)

Implicit Jacobian

In practice

Two issues of theory solution in practice:

  • The meta-gradients require computation of Algi(θ),\mathcal{A}lg_{i}^{\star}(\theta), which is the exact solution to the inner optimization problem. Only approximation could be obtained in practice.

  • Explicitly forming and inverting the matrix Eq.6 for computing the Jacobian may be intractable for large deep learning network.

1), we consider an approximate solution to the inner optimization problem, that can be obtained with iterative optimization algorithms like gradient descent. Red

2), we will perform a partial or approximate matrix inversion. Green

Some Questions:

Use Figure 1 to explain the differences between MAML, first-order MAML, and implicit MAML. Appendix A might be helpful for this.

I will write a new answer for this question independently later including intuition and math details.

Reference:

Last updated

Was this helpful?