MAML, FO-MAML, Reptile

gradient optimization based meta-learning algorithms

MAML(Bi-level optimization)

Problem setting

Meta-Train

Notations:

$\theta^*_{ML}$ optimal meta-learned parameters
$\phi_i$ task-specific parameters for task $i$
$M$ the number of tasks in meta-train, $i$ is the index of task $i$
$\mathcal{D}^{tr}_i$ support set, $\mathcal{D}^{test}_i$ query set in task $i$
$\mathcal{L}(\phi, \mathcal{D})$ loss function with parameter vector and dataset
$\phi_i = \mathcal{A}lg(\theta, \mathcal{D}^{tr}_{i})=\theta - \alpha\nabla_{\theta}\mathcal{L}(\theta,\mathcal{D}^{tr}_i)$ : one (or multiple) steps of gradient descent initialized at $\theta$ . [inner-level of MAML]

Meta-test

Gradient-based Meta-Learning

Task $t$ $\mathcal{T}_t,$ is associated with a finite dataset $\mathcal{D}_{t}=\left\{\mathbf{x}_{t, n}\right\}|_{n=1}^{N_t}$
Task $t$ $\mathcal{T}_t:$ $\mathcal{D}_{t}^{train}, \mathcal{D}_{t}^{val}$
meta parameters $\boldsymbol{\phi} \in \mathbb{R}^{D}$
Task-specific parameters $\boldsymbol{\theta}_{t} \in \mathbb{R}^{D}$
loss function $\ell\left(\mathcal{D}_{t} ; \boldsymbol{\theta}_{t}\right)$

Algorithm 1 is a structure of a typical meta-learning algorithm, which could be:

MAML
iMAML
Reptile

TASKADAPT: task adaptation (inner loop)
The meta-update $\Delta_t$ specifies the contribution of task $t$ to the meta parameters. (outer loop)

MAML

task adaptation: minimizing the training loss $\ell_t^{train}(\boldsymbol{\theta}_t)=\ell(\mathcal{D}_t^{train}; \boldsymbol{\theta}_t)$ by the gradient descent w.r.t. task parameters
meta parameters update: by gradient descent on the validation loss $\ell_t^{val}(\boldsymbol{\theta}_t)=\ell(\mathcal{D}_t^{val}; \boldsymbol{\theta}),$ resulting in the meta update (gradient) for task $t$ : $\Delta_t^{\text{MAML}} = \nabla_{\phi}\ell_t^{\text{val}}(\boldsymbol{\theta}_t(\phi))$

This approach treats the task parameters as a function of the meta parameters, and hence requires back-propagation through the entire L-step task adaptation process. When L is large, this becomes computationally prohibitive.

Reptile

Reptile optimizes $\theta_t$ on the entire dataset $\mathcal{D}_t$ , and move $\phi$ towards the adapted task parameters, yielding $\Delta_t^{\text{Reptile}}=\phi-\boldsymbol{\theta}_t$

iMAML

iMAML introduces an L2 regularizer $\frac{\lambda}{2}||\boldsymbol{\theta}_t-\phi||^2$ to training loss, and optimizes the task parameters on the regularized training loss.

Provided that this task adaptation process converges to a stationary point, implicit differentiation enables the computation of meta gradient based only on the final solution of the adaptation process: $\Delta_{t}^{\mathrm{iMAML}}=\left(\mathbf{I}+\frac{1}{\lambda} \nabla_{\boldsymbol{\theta}_{t}}^{2} \ell_{t}^{\operatorname{train}}\left(\boldsymbol{\theta}_{t}\right)\right)^{-1} \nabla_{\boldsymbol{\theta}_{t}} \ell_{t}^{\mathrm{val}}\left(\boldsymbol{\theta}_{t}\right)$

Derivative process

FO-MAML

Reptile

How iMAML generalizes them

PreviousRelax constraints to continuous NextGradient Descent

Last updated 1 year ago

Was this helpful?