Meta-Learning with Implicit Gradient
NIPS 2019
8/18/2020
https://papers.nips.cc/paper/8306-meta-learning-with-implicit-gradients
Motivation
There are some limitations in MAML. The meta-learning process requires higher-order derivatives, imposes a non-trivial computational and memory burden, and can suffer from vanishing gradients. These limitations make it harder to scale optimization-based meta learning methods to tasks involving medium or large datasets, or those that require many inner-loop optimization steps.
Contributions
The development of the implicit MAML (iMAML) algorithm, an approach for optimization-based meta-learning with deep neural networks that removes the need for differentiating through the optimization path.
The algorithm aims to learn a set of parameters such that an optimization algorithm that is initialized at and regularized to this parameter vector leads to good generalization for a variety of learning tasks.
Few-shot supervised learning and MAML (Bi-level optimization)

Notations:
optimal meta-learned parameters
the number of tasks in meta-train, is the index of task
support set, query set in task
loss function with parameter vector and dataset
: one (or multiple) steps of gradient descent initialized at . [inner-level of MAML]
Proximal regularization in the inner level
Two challenges in MAML in cases like ill-conditioned optimization landscapes and medium-shot learning, we may want to take many gradient steps:
we need to store and differentiate through the long optimization path of , imposing a considerable computation and memory burden
The dependence of the model-parameters {} on the meta-parameters ( ) shrinks and vanishes as the number of gradient steps in grows, making meta-learning difficult.
A more explicitly regularized algorithm is considered:

New Bi-level optimization problem after proximal regularization
Simplify the notation:

Total and Partial Derivatives:
( denotes the total derivative, denotes the partial derivative)
Implicit MAML
Goal: to solve the bi-level meta-learning problem in Eq (4) using an iterative gradient based algorithm of the form:
Specifically:
Note: many available iterative gradient based algorithm and other optimization methods could be used here:
Meta-Gradient Computation
In theory
Theoretically we can calculate the meta-gradient computation exactly using the following lemma.
Proof: We drop subscripts in the proof for convenience.
is the minimizer of , namely:
According to the stationary point conditions, we have:
which is an implicit equation.
When the derivative exists:
Implicit Jacobian
In practice
Two issues of theory solution in practice:
The meta-gradients require computation of which is the exact solution to the inner optimization problem. Only approximation could be obtained in practice.
Explicitly forming and inverting the matrix Eq.6 for computing the Jacobian may be intractable for large deep learning network.
1), we consider an approximate solution to the inner optimization problem, that can be obtained with iterative optimization algorithms like gradient descent. Red
2), we will perform a partial or approximate matrix inversion. Green

Some Questions:
Use Figure 1 to explain the differences between MAML, first-order MAML, and implicit MAML. Appendix A might be helpful for this.
I will write a new answer for this question independently later including intuition and math details.
In Section 3.1, it talks about the high memory cost when the number of gradient steps is large for MAML. What is the memory cost with respect to the number of gradient steps?
Using iterative algorithm (Gradient Descent) for the optimization of inner loop has drawback: depending explicitly on the path of the optimization, which has to be fully stored in memory, quickly becoming intractable when the number of gradient steps needed is large.
According to Theorem 1: Algorithm 2 can be implemented using at most
gradient computations of
memory.
Detailed proof will be given later.
Please provide a review of the conjugate gradient algorithm that is used in Section 3.1.
Another independent tutorial of conjugate gradient will be given.
Please explain why can be obtained as an approximate solution to Problem (7).
According to definition 2:
is an approximation of the meta-gradient for task .
In order to solve the following problem:
(7)
we have:
So can be obtained as an approximate solution to optimization problem (7).
About Line 3 in Algorithm 2, what iterative optimization solver can be used and how many iterations are enough such that the error .
Potential possible iterative optimization solver here could be:
gradient descent
Nesterov's accelerated gradient descent
If Nesterov's accelerated gradient descent algorithm is used to compute , the number of iterations could be: (Theorem 2)
What is the relationships between in Definition 2 and the derivative Jacobian?
According to definition 2:
is an approximation of the meta-gradient for task .
is the derivative Jacobian.
Reference:
Last updated
Was this helpful?