There are some limitations in MAML. The meta-learning process requires higher-order derivatives, imposes a non-trivial computational and memory burden, and can suffer from vanishing gradients. These limitations make it harder to scale optimization-based meta learning methods to tasks involving medium or large datasets, or those that require many inner-loop optimization steps.
Contributions
The development of the implicit MAML (iMAML) algorithm, an approach for optimization-based meta-learning with deep neural networks that removes the need for differentiating through the optimization path.
The algorithm aims to learn a set of parameters such that an optimization algorithm that is initialized at and regularized to this parameter vector leads to good generalization for a variety of learning tasks.
Few-shot supervised learning and MAML (Bi-level optimization)
Notations:
θML∗ optimal meta-learned parameters
M the number of tasks in meta-train, i is the index of task i
Ditr support set, Ditest query set in task i
L(ϕ,D) loss function with parameter vector and dataset
ϕi=Alg(θ,Ditr)=θ−α∇θL(θ,Ditr) : one (or multiple) steps of gradient descent initialized at θ. [inner-level of MAML]
Proximal regularization in the inner level
Two challenges in MAML in cases like ill-conditioned optimization landscapes and medium-shot learning, we may want to take many gradient steps:
we need to store and differentiate through the long optimization path of Alg, imposing a considerable computation and memory burden
The dependence of the model-parameters {ϕi} on the meta-parameters ( θ ) shrinks and vanishes as the number of gradient steps in Alg grows, making meta-learning difficult.
A more explicitly regularized algorithm is considered:
New Bi-level optimization problem after proximal regularization
Goal: to solve the bi-level meta-learning problem in Eq (4) using an iterative gradient based algorithm of the form:
θ←θ−ηdθF(θ)
Specifically:
θ←θ−ηM1i=1∑MdθdAlgi⋆(θ)∇ϕLi(Algi⋆(θ))
Note: many available iterative gradient based algorithm and other optimization methods could be used here:
Meta-Gradient Computation
In theory
Theoretically we can calculate the meta-gradient computation dθdAlgi⋆(θ) exactly using the following lemma.
lemma 1: (Implicit Jacobian) Consider Algi⋆(θ) as defined in Eq.4 for task Ti. Let ϕi=Algi⋆(θ) be the results of Algi⋆(θ). If (I+λ1∇ϕ2Li^(ϕi)) is invertible, then the derivative Jacobian is
dθdAlgi⋆(θ)=(I+λ1∇ϕ2Li^(ϕi))−1 (6)
Proof: We drop i subscripts in the proof for convenience.
ϕ is the minimizer of G(ϕ′,θ) , namely:
ϕ=Alg⋆(θ):=ϕ′∈ΦargminG(ϕ′,θ), where ϕ=G(ϕ′,θ)=L^(ϕ′)+2λϕ′−θ2
According to the stationary point conditions, we have:
The meta-gradients require computation of Algi⋆(θ), which is the exact solution to the inner optimization problem. Only approximation could be obtained in practice.
Explicitly forming and inverting the matrix Eq.6 for computing the Jacobian may be intractable for large deep learning network.
1), we consider an approximate solution to the inner optimization problem, that can be obtained with iterative optimization algorithms like gradient descent. Red
2), we will perform a partial or approximate matrix inversion. Green
Some Questions:
Use Figure 1 to explain the differences between MAML, first-order MAML, and implicit MAML. Appendix A might be helpful for this.
I will write a new answer for this question independently later including intuition and math details.
In Section 3.1, it talks about the high memory cost when the number of gradient steps is large for MAML. What is the memory cost with respect to the number of gradient steps?
Using iterative algorithm (Gradient Descent) for the optimization of inner loop has drawback: depending explicitly on the path of the optimization, which has to be fully stored in memory, quickly becoming intractable when the number of gradient steps needed is large.
According to Theorem 1: Algorithm 2 can be implemented using at most
O~(κlog(ϵpoly(κ,D,B,L,ρ,μ,λ))) gradient computations of L^i(⋅)
2⋅Mem(∇L^i) memory.
Detailed proof will be given later.
Please provide a review of the conjugate gradient algorithm that is used in Section 3.1.
Another independent tutorial of conjugate gradient will be given.
Please explain why gi can be obtained as an approximate solution to Problem (7).
According to definition 2:
gi−(I+λ1∇ϕ2L^i(ϕi))−1∇ϕLi(ϕi)≤δ′
gi is an approximation of the meta-gradient for task i.