Meta-Learning with Implicit Gradient

NIPS 2019

8/18/2020

https://papers.nips.cc/paper/8306-meta-learning-with-implicit-gradients

Motivation

There are some limitations in MAML. The meta-learning process requires higher-order derivatives, imposes a non-trivial computational and memory burden, and can suffer from vanishing gradients. These limitations make it harder to scale optimization-based meta learning methods to tasks involving medium or large datasets, or those that require many inner-loop optimization steps.

Contributions

The development of the implicit MAML (iMAML) algorithm, an approach for optimization-based meta-learning with deep neural networks that removes the need for differentiating through the optimization path.

The algorithm aims to learn a set of parameters such that an optimization algorithm that is initialized at and regularized to this parameter vector leads to good generalization for a variety of learning tasks.

Few-shot supervised learning and MAML (Bi-level optimization)

Notations:

$\theta^*_{ML}$ optimal meta-learned parameters
$M$ the number of tasks in meta-train, $i$ is the index of task $i$
$\mathcal{D}^{tr}_i$ support set, $\mathcal{D}^{test}_i$ query set in task $i$
$\mathcal{L}(\phi, \mathcal{D})$ loss function with parameter vector and dataset
$\phi_i = \mathcal{A}lg(\theta, \mathcal{D}^{tr}_{i})=\theta - \alpha\nabla_{\theta}\mathcal{L}(\theta,\mathcal{D}^{tr}_i)$ : one (or multiple) steps of gradient descent initialized at $\theta$ . [inner-level of MAML]

Proximal regularization in the inner level

Two challenges in MAML in cases like ill-conditioned optimization landscapes and medium-shot learning, we may want to take many gradient steps:

we need to store and differentiate through the long optimization path of $\mathcal{A}lg$ , imposing a considerable computation and memory burden
The dependence of the model-parameters { $\phi_i$ } on the meta-parameters ( $\theta$ ) shrinks and vanishes as the number of gradient steps in $\mathcal{A}lg$ grows, making meta-learning difficult.

A more explicitly regularized algorithm is considered:

New Bi-level optimization problem after proximal regularization

Simplify the notation:

$\mathcal{L}_i(\phi):=\mathcal{L}(\phi,\mathcal{D}_i^{test}), \hat{\mathcal{L}}_i(\phi):=\mathcal{L}(\phi,\mathcal{D}_i^{tr}), \mathcal{A}lg(\theta):=\mathcal{A}lg(\theta,\mathcal{D}^{tr}_i)$

Total and Partial Derivatives:

( $\boldsymbol{d}$ denotes the total derivative, $\nabla$ denotes the partial derivative)

\boldsymbol{d}_{\theta}\mathcal{L}_i(\mathcal{A}lg_i(\theta)) = \frac{\boldsymbol{d}\mathcal{A}lg_i(\theta)}{d\theta}\nabla_{\phi}\mathcal{L}_i(\phi)|_{\phi=\mathcal{A}lg_i(\theta)} = \frac{\boldsymbol{d}\mathcal{A}lg_i(\theta)}{d\theta}\nabla_{\phi}\mathcal{L}_i(\mathcal{A}lg_i(\theta))

Implicit MAML

Goal: to solve the bi-level meta-learning problem in Eq (4) using an iterative gradient based algorithm of the form:

\theta \leftarrow \theta -\eta d_{\theta}F(\theta)

Specifically:

\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}-\eta \frac{1}{M} \sum_{i=1}^{M} \frac{d \mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})}{d \boldsymbol{\theta}} \nabla_{\phi} \mathcal{L}_{i}\left(\mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})\right)

Note: many available iterative gradient based algorithm and other optimization methods could be used here:

quasi-Newton or Newton methods
Adam
gradient descent with momentum
etc.,
$\nabla_{\phi} \mathcal{L}_{i}\left(\mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})\right)$ can be easily obtained in practice via automatic differentiation
$\frac{d \mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})}{d \boldsymbol{\theta}}$ presents the primary challenge. $\mathcal{A}lg_i^{\star}(\theta)$ is implicitly defined as an optimization problem in Equ.4.

Meta-Gradient Computation

In theory

Theoretically we can calculate the meta-gradient computation $\frac{d \mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})}{d \boldsymbol{\theta}}$ exactly using the following lemma.

lemma 1: (Implicit Jacobian) Consider $\mathcal{A}lg_i^{\star}(\theta)$ as defined in Eq.4 for task $\mathcal{T}_i$ . Let $\phi_i=\mathcal{A}lg_i^{\star}(\theta)$ be the results of $\mathcal{A}lg_i^{\star}(\theta)$ . If $(I+\frac{1}{\lambda}\nabla^2_{\phi}\hat{\mathcal{L}_i}(\phi_i) )$ is invertible, then the derivative Jacobian is

$\frac{d \mathcal{A}lg _{i}^{\star}(\boldsymbol{\theta})}{d \boldsymbol{\theta}} = (I+\frac{1}{\lambda}\nabla^2_{\phi}\hat{\mathcal{L}_i}(\phi_i) )^{-1}$ (6)

Proof: We drop $i$ subscripts in the proof for convenience.

$\phi$ is the minimizer of $G(\phi',\theta)$ , namely:

$\phi = \mathcal{A}lg^{\star}(\boldsymbol{\theta}):=\underset{\boldsymbol{\phi}^{\prime} \in \Phi}{\operatorname{argmin}} G\left(\boldsymbol{\phi}^{\prime}, \boldsymbol{\theta}\right), \text { where } \phi = G\left(\boldsymbol{\phi}^{\prime}, \boldsymbol{\theta}\right)=\hat{\mathcal{L}}\left(\boldsymbol{\phi}^{\prime}\right)+\frac{\lambda}{2}\left\|\boldsymbol{\phi}^{\prime}-\boldsymbol{\theta}\right\|^{2}$

According to the stationary point conditions, we have:

$\nabla_{\phi'}G(\phi',\theta)|_{\phi'=\phi}=0 \\ \implies \nabla_{\phi'}(\hat{\mathcal{L}}(\phi')+\frac{\lambda}{2}\left\|\phi'-\theta \right\|^2 )|_{\phi'=\phi}=\nabla\hat{\mathcal{L}}(\phi)+\lambda(\phi-\theta) \\ \implies \phi=\theta-\frac{1}{\lambda}\nabla\hat{\mathcal{L}}(\theta)$

which is an implicit equation.

When the derivative exists:

$\frac{d\phi}{d\theta}=I-\frac{1}{\lambda}\nabla^2(\hat{\mathcal{L}}(\theta)) \frac{d\phi}{d\theta}\\ \implies (I+\frac{1}{\lambda}\nabla_2\hat{\mathcal{L}}(\theta)) \frac{d\phi}{d\theta}=I \\ \implies \frac{d\phi}{d\theta} = (I+\frac{1}{\lambda}\nabla_2\hat{\mathcal{L}}(\theta))^{-1}$

Implicit Jacobian

In practice

Two issues of theory solution in practice:

The meta-gradients require computation of $\mathcal{A}lg_{i}^{\star}(\theta),$ which is the exact solution to the inner optimization problem. Only approximation could be obtained in practice.
Explicitly forming and inverting the matrix Eq.6 for computing the Jacobian may be intractable for large deep learning network.

1), we consider an approximate solution to the inner optimization problem, that can be obtained with iterative optimization algorithms like gradient descent. Red

2), we will perform a partial or approximate matrix inversion. Green

Some Questions:

Use Figure 1 to explain the differences between MAML, first-order MAML, and implicit MAML. Appendix A might be helpful for this.

I will write a new answer for this question independently later including intuition and math details.

Reference:

PreviousPAPER NOTES NextDARTS: Differentiable Architecture Search

Last updated 1 year ago

Was this helpful?