# MAML, FO-MAML, Reptile

## MAML(Bi-level optimization)&#x20;

### Problem setting

### Meta-Train

![The goal of meta-learning is to learn meta-parameters that produce good task specific parameters after adaptation.](https://1687130946-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MEnQbUIupyAn8eMmrmG%2F-MF8xH7BOfaktMQEd7Ow%2F-MF8z_YBtXgQpcIW6UBe%2Fequ1.png?alt=media\&token=63f42d1f-129b-4c73-a6fb-6a8b5f63809b)

#### Notations:

* $$\theta^\*\_{ML}$$ optimal meta-learned parameters
* $$\phi\_i$$  task-specific parameters for task $$i$$&#x20;
* $$M$$  the number of tasks in meta-train, $$i$$ is the index of task $$i$$
* $$\mathcal{D}^{tr}\_i$$ support set, $$\mathcal{D}^{test}\_i$$ query set in task $$i$$&#x20;
* $$\mathcal{L}(\phi, \mathcal{D})$$ loss function with parameter vector and dataset
* $$\phi\_i = \mathcal{A}lg(\theta, \mathcal{D}^{tr}*{i})=\theta - \alpha\nabla*{\theta}\mathcal{L}(\theta,\mathcal{D}^{tr}\_i)$$ : one (or multiple) steps of gradient descent initialized at $$\theta$$. \[**inner-level of MAML**]

### Meta-test

## Gradient-based Meta-Learning

* Task $$t$$ $$\mathcal{T}*t,$$ is associated with a finite dataset $$\mathcal{D}*{t}=\left{\mathbf{x}*{t, n}\right}|*{n=1}^{N\_t}$$&#x20;
* Task $$t$$ $$\mathcal{T}*t:$$ $$\mathcal{D}*{t}^{train}, \mathcal{D}\_{t}^{val}$$&#x20;
* meta parameters $$\boldsymbol{\phi} \in \mathbb{R}^{D}$$&#x20;
* Task-specific parameters $$\boldsymbol{\theta}\_{t} \in \mathbb{R}^{D}$$&#x20;
* loss function $$\ell\left(\mathcal{D}*{t} ; \boldsymbol{\theta}*{t}\right)$$&#x20;

![](https://1687130946-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MEnQbUIupyAn8eMmrmG%2F-MG1ytVrabx-stcHcDXA%2F-MG2GQq565UtOb53Y8sR%2Falgo1.png?alt=media\&token=aad85e43-c329-405a-bce6-0ddce8a1eefc)

**Algorithm 1** is a structure of a typical meta-learning algorithm, which could be:

* MAML
* iMAML
* Reptile

1. TASKADAPT: task adaptation (**inner loop**)
2. The meta-update$$\Delta\_t$$specifies the contribution of task $$t$$ to the meta parameters. (**outer loop**)

### MAML

1. task adaptation: minimizing the training loss $$\ell\_t^{train}(\boldsymbol{\theta}\_t)=\ell(\mathcal{D}\_t^{train}; \boldsymbol{\theta}\_t)$$ by the gradient descent w\.r.t. task parameters
2. meta parameters update: by gradient descent on the validation loss $$\ell\_t^{val}(\boldsymbol{\theta}\_t)=\ell(\mathcal{D}*t^{val}; \boldsymbol{\theta}),$$ resulting in the meta update (gradient) for task $$t$$ : $$\Delta\_t^{\text{MAML}} = \nabla*{\phi}\ell\_t^{\text{val}}(\boldsymbol{\theta}\_t(\phi))$$

This approach treats the task parameters as a function of the meta parameters, and hence requires back-propagation through the entire L-step task adaptation process. When L is large, this becomes computationally prohibitive.

### Reptile

Reptile optimizes $$\theta\_t$$ on the entire dataset $$\mathcal{D}\_t$$ , and move $$\phi$$ towards the adapted task parameters, yielding $$\Delta\_t^{\text{Reptile}}=\phi-\boldsymbol{\theta}\_t$$&#x20;

### iMAML

iMAML introduces an L2 regularizer $$\frac{\lambda}{2}||\boldsymbol{\theta}\_t-\phi||^2$$ to training loss, and optimizes the task parameters on the regularized training loss.

Provided that this task adaptation process converges to a stationary point, implicit differentiation enables the computation of meta gradient based only on the final solution of the adaptation process: $$\Delta\_{t}^{\mathrm{iMAML}}=\left(\mathbf{I}+\frac{1}{\lambda} \nabla\_{\boldsymbol{\theta}*{t}}^{2} \ell*{t}^{\operatorname{train}}\left(\boldsymbol{\theta}*{t}\right)\right)^{-1} \nabla*{\boldsymbol{\theta}*{t}} \ell*{t}^{\mathrm{val}}\left(\boldsymbol{\theta}\_{t}\right)$$

### Derivative process

## FO-MAML

## Reptile

## How iMAML generalizes them
