# ANIL (Almost No Inner Loop)

## Motivation

Idea of MAML: to builder a meta-learner which could learn a set of optimal initialization useful for learning different tasks, then adapt to specific tasks quickly (within a few gradient steps) and efficiently (with only a few examples).

It is also viewed as a bi-level optimization problem. Two types of parameters updates are required:

* the inner loop and the outer loop
* The inner loop takes the initialization and performs the task-specific adaptation to new tasks.
* The outer loop updates the meta-initialization of the neural network architecture parameters to a setting which could be adopted in the inner loop to enable fast adaptation to new tasks.

Conjecture/hypothesis of the author of ANIL:

"we can obtain the same rapid learning performance of MAML solely through feature reuse."

## Rapid Learning vs Feature Reuse

#### Rapid learning:

"In ***rapid learning***, the meta-initialization in the outer loop results in a parameter setting that is favorable for fast learning, thus significant adaptation to new tasks can rapidly take place in the inner loop. "

"In ***feature reuse***, the meta-initialization already contains useful features that can be reused, so little adaptation on the parameters is required in the inner loop."

"To prove feature reuse is a competitive alternative to rapid learning in MAML, the authors proposed a simplified algorithm, **ANIL**, where the inner loop is removed for all **but the task-specific head of the underlying neural network during training and testing**."

## ANIL

* base model/learner:  a neural network architecture (i.e., CNN)
* $$\theta$$ : the set of meta-initialization parameters of the feature extractable layers of the neural network architecture
* $$w$$: the set of meta-initialization parameters of the head layer (final classification layer?)
* $$\phi\_{\theta}$$: the feature extractor parametrized by $$\theta$$&#x20;
* $$\hat{y} = w^{T}\phi\_{\theta}(x)$$ : label prediction

#### Outer loop

Given $$\theta\_{i}$$ and $$w\_i$$ at iteration step $$i$$ , the outer loop will update both parameters via gradient descent:

$$
\theta\_{i+1} = \theta\_i - \alpha\nabla\_{\theta\_i}\mathcal{L}({w^{\prime}*i}^{T}\phi*{\theta^{\prime}*i}(x), y)\\
w*{i+1} = w\_i - \alpha\nabla\_{w\_i}\mathcal{L}({w^{\prime}*i}^{T}\phi*{\theta^{\prime}\_i}(x), y)
$$

* $$\mathcal{L}$$ is the loss for one task (or several tasks) (query set loss I think)
* $$\alpha$$ meta learning rate
* $$\theta^{\prime}\_i$$ task-specific parameters (task adapted parameters) after one/several steps from $$\theta\_i$$ in inner loop
* $$w^{\prime}\_i$$ task-specific parameters (task adapted parameters) after one/several steps from $$w\_i$$ in inner loop
* $$(x,y)$$ samples from query set

#### Inner loop (one step for illustration)

$$
{\color{red} \theta^{\prime}*{i} = \theta*{i}}  \\
w^{\prime}*i = w\_i - \beta\nabla*{w\_i}\mathcal{L}(w\_i^{T}\phi\_{\theta\_i}(x), y)
$$

* $$\beta$$ learning rate in inner loop
* $$\mathcal{L}$$ loss function for one/several tasks during support set
* $$(x,y)$$ samples from support set

In contrast:&#x20;

inner loop in MAML:

$$
{\color{red} \theta^{\prime}*i = \theta\_i - \beta\nabla*{\theta\_i}\mathcal{L}(w\_i^{T}\phi\_{\theta\_i}(x), y)} \\
w^{\prime}*i = w\_i - \beta\nabla*{w\_i}\mathcal{L}(w\_i^{T}\phi\_{\theta\_i}(x), y)
$$

### advantages:

* much more computationally efficient since it requires fewer updates in the inner loop.&#x20;
* performance is comparable with MAML

## Reference:

* <https://openreview.net/forum?id=rkgMkCEtPB>
* <https://maithraraghu.com/assets/files/RapidLearningFeatureReuse.pdf>
* <https://maithraraghu.com/assets/files/ICLR2020_ANIL.pdf>
* <http://learn2learn.net/tutorials/anil_tutorial/ANIL_tutorial/>
