There are some limitations in MAML. The meta-learning process requires higher-order derivatives, imposes a non-trivial computational and memory burden, and can suffer from vanishing gradients. These limitations make it harder to scale optimization-based meta learning methods to tasks involving medium or large datasets, or those that require many inner-loop optimization steps.
Contributions
The development of the implicit MAML (iMAML) algorithm, an approach for optimization-based meta-learning with deep neural networks that removes the need for differentiating through the optimization path.
The algorithm aims to learn a set of parameters such that an optimization algorithm that is initialized at and regularized to this parameter vector leads to good generalization for a variety of learning tasks.
Few-shot supervised learning and MAML (Bi-level optimization)
Notations:
Proximal regularization in the inner level
Two challenges in MAML in cases like ill-conditioned optimization landscapes and medium-shot learning, we may want to take many gradient steps:
A more explicitly regularized algorithm is considered:
New Bi-level optimization problem after proximal regularization
Simplify the notation:
Total and Partial Derivatives:
Implicit MAML
Goal: to solve the bi-level meta-learning problem in Eq (4) using an iterative gradient based algorithm of the form:
Specifically:
Note: many available iterative gradient based algorithm and other optimization methods could be used here:
Meta-Gradient Computation
In theory
According to the stationary point conditions, we have:
which is an implicit equation.
When the derivative exists:
Implicit Jacobian
In practice
Two issues of theory solution in practice:
Explicitly forming and inverting the matrix Eq.6 for computing the Jacobian may be intractable for large deep learning network.
1), we consider an approximate solution to the inner optimization problem, that can be obtained with iterative optimization algorithms like gradient descent. Red
2), we will perform a partial or approximate matrix inversion. Green
Some Questions:
Use Figure 1 to explain the differences between MAML, first-order MAML, and implicit MAML. Appendix A might be helpful for this.
I will write a new answer for this question independently later including intuition and math details.
In Section 3.1, it talks about the high memory cost when the number of gradient steps is large for MAML. What is the memory cost with respect to the number of gradient steps?
Using iterative algorithm (Gradient Descent) for the optimization of inner loop has drawback: depending explicitly on the path of the optimization, which has to be fully stored in memory, quickly becoming intractable when the number of gradient steps needed is large.
According to Theorem 1: Algorithm 2 can be implemented using at most
Detailed proof will be given later.
Please provide a review of the conjugate gradient algorithm that is used in Section 3.1.
Another independent tutorial of conjugate gradient will be given.
According to definition 2:
In order to solve the following problem:
we have:
Potential possible iterative optimization solver here could be:
gradient descent
Nesterov's accelerated gradient descent
According to definition 2:
Reference:
θML∗ optimal meta-learned parameters
M the number of tasks in meta-train, i is the index of task i
Ditr support set, Ditest query set in task i
L(ϕ,D) loss function with parameter vector and dataset
ϕi=Alg(θ,Ditr)=θ−α∇θL(θ,Ditr) : one (or multiple) steps of gradient descent initialized at θ. [inner-level of MAML]
we need to store and differentiate through the long optimization path of Alg, imposing a considerable computation and memory burden
The dependence of the model-parameters {ϕi} on the meta-parameters ( θ ) shrinks and vanishes as the number of gradient steps in Alg grows, making meta-learning difficult.
∇ϕLi(Algi⋆(θ)) can be easily obtained in practice via automatic differentiation
dθdAlgi⋆(θ) presents the primary challenge. Algi⋆(θ) is implicitly defined as an optimization problem in Equ.4.
Theoretically we can calculate the meta-gradient computation dθdAlgi⋆(θ) exactly using the following lemma.
lemma 1: (Implicit Jacobian) Consider Algi⋆(θ) as defined in Eq.4 for task Ti. Let ϕi=Algi⋆(θ) be the results of Algi⋆(θ). If (I+λ1∇ϕ2Li^(ϕi)) is invertible, then the derivative Jacobian is
dθdAlgi⋆(θ)=(I+λ1∇ϕ2Li^(ϕi))−1 (6)
Proof: We drop i subscripts in the proof for convenience.
ϕ is the minimizer of G(ϕ′,θ) , namely:
ϕ=Alg⋆(θ):=ϕ′∈ΦargminG(ϕ′,θ), where ϕ=G(ϕ′,θ)=L^(ϕ′)+2λϕ′−θ2
The meta-gradients require computation of Algi⋆(θ), which is the exact solution to the inner optimization problem. Only approximation could be obtained in practice.
O~(κlog(ϵpoly(κ,D,B,L,ρ,μ,λ))) gradient computations of L^i(⋅)
2⋅Mem(∇L^i) memory.
Please explain why gi can be obtained as an approximate solution to Problem (7).
gi−(I+λ1∇ϕ2L^i(ϕi))−1∇ϕLi(ϕi)≤δ′
gi is an approximation of the meta-gradient for task i.