ANIL (Almost No Inner Loop)

ICLR 2020 8-23-2020

Motivation

Idea of MAML: to builder a meta-learner which could learn a set of optimal initialization useful for learning different tasks, then adapt to specific tasks quickly (within a few gradient steps) and efficiently (with only a few examples).

It is also viewed as a bi-level optimization problem. Two types of parameters updates are required:

the inner loop and the outer loop
The inner loop takes the initialization and performs the task-specific adaptation to new tasks.
The outer loop updates the meta-initialization of the neural network architecture parameters to a setting which could be adopted in the inner loop to enable fast adaptation to new tasks.

Conjecture/hypothesis of the author of ANIL:

"we can obtain the same rapid learning performance of MAML solely through feature reuse."

Rapid Learning vs Feature Reuse

Rapid learning:

"In rapid learning, the meta-initialization in the outer loop results in a parameter setting that is favorable for fast learning, thus significant adaptation to new tasks can rapidly take place in the inner loop. "

"In feature reuse, the meta-initialization already contains useful features that can be reused, so little adaptation on the parameters is required in the inner loop."

"To prove feature reuse is a competitive alternative to rapid learning in MAML, the authors proposed a simplified algorithm, ANIL, where the inner loop is removed for all but the task-specific head of the underlying neural network during training and testing."

ANIL

base model/learner: a neural network architecture (i.e., CNN)
$\theta$ : the set of meta-initialization parameters of the feature extractable layers of the neural network architecture
$w$ : the set of meta-initialization parameters of the head layer (final classification layer?)
$\phi_{\theta}$ : the feature extractor parametrized by $\theta$
$\hat{y} = w^{T}\phi_{\theta}(x)$ : label prediction

Outer loop

Given $\theta_{i}$ and $w_i$ at iteration step $i$ , the outer loop will update both parameters via gradient descent:

\theta_{i+1} = \theta_i - \alpha\nabla_{\theta_i}\mathcal{L}({w^{\prime}_i}^{T}\phi_{\theta^{\prime}_i}(x), y)\\ w_{i+1} = w_i - \alpha\nabla_{w_i}\mathcal{L}({w^{\prime}_i}^{T}\phi_{\theta^{\prime}_i}(x), y)

$\mathcal{L}$ is the loss for one task (or several tasks) (query set loss I think)
$\alpha$ meta learning rate
$\theta^{\prime}_i$ task-specific parameters (task adapted parameters) after one/several steps from $\theta_i$ in inner loop
$w^{\prime}_i$ task-specific parameters (task adapted parameters) after one/several steps from $w_i$ in inner loop
$(x,y)$ samples from query set

Inner loop (one step for illustration)

{\color{red} \theta^{\prime}_{i} = \theta_{i}} \\ w^{\prime}_i = w_i - \beta\nabla_{w_i}\mathcal{L}(w_i^{T}\phi_{\theta_i}(x), y)

$\beta$ learning rate in inner loop
$\mathcal{L}$ loss function for one/several tasks during support set
$(x,y)$ samples from support set

In contrast:

inner loop in MAML:

{\color{red} \theta^{\prime}_i = \theta_i - \beta\nabla_{\theta_i}\mathcal{L}(w_i^{T}\phi_{\theta_i}(x), y)} \\ w^{\prime}_i = w_i - \beta\nabla_{w_i}\mathcal{L}(w_i^{T}\phi_{\theta_i}(x), y)

advantages:

much more computationally efficient since it requires fewer updates in the inner loop.
performance is comparable with MAML

Reference:

PreviousEditable Neural Networks NextMeta-Learning Representation for Continual Learning

Last updated 5 years ago

Was this helpful?