Modular Meta-Learning with Shrinkage

8-24-2020

Motivation

The ability to meta-learn large models with only a few task-specific components is important in many real-world problems:

multi-speaker text-to-speech synthesis

So updating only these task-specific modules then allows the model to be adapted to low-data tasks for as many steps as necessary without risking overfitting.

Existing meta-learning approaches either do not scale to long adaptation or else rely on handcrafted task-specific architectures.

A new meta-learning approach based on Bayesian shrinkage is proposed to automatically discover and learn both task-specific and general reusable modules.

works well in few shot text-to-speech domain.

MAML, iMAML, Reptile are special cases of the proposed method.

Gradient-based Meta-Learning

Task $t$ $\mathcal{T}_t,$ is associated with a finite dataset $\mathcal{D}_{t}=\left\{\mathbf{x}_{t, n}\right\}|_{n=1}^{N_t}$
Task $t$ $\mathcal{T}_t:$ $\mathcal{D}_{t}^{train}, \mathcal{D}_{t}^{val}$
meta parameters $\boldsymbol{\phi} \in \mathbb{R}^{D}$
Task-specific parameters $\boldsymbol{\theta}_{t} \in \mathbb{R}^{D}$
loss function $\ell\left(\mathcal{D}_{t} ; \boldsymbol{\theta}_{t}\right)$

Algorithm 1 is a structure of a typical meta-learning algorithm, which could be:

MAML
iMAML
Reptile

TASKADAPT: task adaptation (inner loop)
The meta-update $\Delta_t$ specifies the contribution of task $t$ to the meta parameters. (outer loop)

MAML

task adaptation: minimizing the training loss $\ell_t^{train}(\boldsymbol{\theta}_t)=\ell(\mathcal{D}_t^{train}; \boldsymbol{\theta}_t)$ by the gradient descent w.r.t. task parameters
meta parameters update: by gradient descent on the validation loss $\ell_t^{val}(\boldsymbol{\theta}_t)=\ell(\mathcal{D}_t^{val}; \boldsymbol{\theta}),$ resulting in the meta update (gradient) for task $t$ : $\Delta_t^{\text{MAML}} = \nabla_{\phi}\ell_t^{\text{val}}(\boldsymbol{\theta}_t(\phi))$

This approach treats the task parameters as a function of the meta parameters, and hence requires back-propagation through the entire L-step task adaptation process. When L is large, this becomes computationally prohibitive.

Reptile

Reptile optimizes $\theta_t$ on the entire dataset $\mathcal{D}_t$ , and move $\phi$ towards the adapted task parameters, yielding $\Delta_t^{\text{Reptile}}=\phi-\boldsymbol{\theta}_t$

iMAML

iMAML introduces an L2 regularizer $\frac{\lambda}{2}||\boldsymbol{\theta}_t-\phi||^2$ to training loss, and optimizes the task parameters on the regularized training loss.

Provided that this task adaptation process converges to a stationary point, implicit differentiation enables the computation of meta gradient based only on the final solution of the adaptation process: $\Delta_{t}^{\mathrm{iMAML}}=\left(\mathbf{I}+\frac{1}{\lambda} \nabla_{\boldsymbol{\theta}_{t}}^{2} \ell_{t}^{\operatorname{train}}\left(\boldsymbol{\theta}_{t}\right)\right)^{-1} \nabla_{\boldsymbol{\theta}_{t}} \ell_{t}^{\mathrm{val}}\left(\boldsymbol{\theta}_{t}\right)$

Modular Bayesian Meta-Learning

Standard meta-learning: one base learner (Neural Network)'s parameters update at the same time.
- inefficient and prone to overfitting
Split the network parameters into two groups:
- a group varying across tasks
- a group that is shared across tasks

In this paper: (task independent modules)

We assume that the network parameters can be partitioned into $M$ disjoint modules, in general.

$\boldsymbol{\theta}_t = (\boldsymbol{\theta}_{t,1}, \boldsymbol{\theta}_{t,2}, \dots, \boldsymbol{\theta}_{t,m}, \dots, \boldsymbol{\theta}_{t,M})$

where $\boldsymbol{\theta}_{t,m}$ : parameters in module $m$ for task $t$ .

A shrinkage parameter $\sigma_m$ is associated with each module to control the deviation of task-dependent parameters $\boldsymbol{\theta}_{t,m}$ from the central value $\boldsymbol{\phi}_m$ (meta-parameters)

How to define modules? Modules can correspond to:

Layers: $\boldsymbol{\theta}_{t,m}$ could be the weights of the $m\text{-th}$ layer of a NN for task $t$ [This paper treats each layer as a module]
Receptive fields
the encoder and decoder in an auto-encoder
the heads in a multi-task learning model
any other grouping of interest

Hierarchical Bayesian Model

(with a factored probability density):

p\left(\boldsymbol{\theta}_{1:T}, \mathcal{D}|\boldsymbol{\sigma}^2, \boldsymbol{\phi} \right) = \prod_{t=1}^{T}\prod_{m=1}^{M}\mathcal{N}(\boldsymbol{\theta}_{t,m}|\phi_m, \sigma_m^2\boldsymbol{I})\prod_{t=1}^{T}p(\mathcal{D}_t|\boldsymbol{\theta}_t)

$\boldsymbol{\phi}$ : shared meta parameters, the initialization of the NN parameters for each task $\boldsymbol{\theta}_t$
$\boldsymbol{\theta}_{t,m}$ are conditionally independent of those of all other tasks given some "central" parameters. Namely: $\boldsymbol{\theta}_{t,m} \sim \mathcal{N}(\boldsymbol{\theta}_{t,m}|\phi_m, \sigma_m^2\boldsymbol{I})$ with mean $\phi_m$ and variance $\sigma_m^2$ ( $\boldsymbol{I}$ is the identity matrix)
$\boldsymbol{\sigma}$ : shrinkage parameters.
$\sigma_m^2$ : the $m\text{-th}$ module scalar shrinkage parameter, which measures the degree to which $\theta_{t,m}$ can deviate from $\phi_m$ . If $\sigma_m\approx 0$ , then $\theta_{t,m} \approx \phi_m$ , when $\sigma_m$ shrinks to 0, the parameters of module $m$ become task independent.
${\color{blue}{meta-parameters}}: \mathbf{\Phi}=\left(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$

For values of $\sigma_m^2$ near zero, the difference between parameters $\theta_{t,m}$ and mean $\phi_m$ will be shrunk to zero and thus module $m$ will become $\text{\color{red}task independent}$ .

Thus, by learning $\boldsymbol{\sigma}^2$ , we can discover which modules are task independent.
These independent modules can be re-used at meta-test time
- reducing the computational burden of adaptation
- and likely improving generalization

Meta-Learning as Parameter Estimation

Goal: estimate the parameters $\boldsymbol{\phi}$ and $\boldsymbol{\sigma}^2$ .

Standard solution: Maximize the marginal likelihood (intractable)

$p(\mathcal{D}|\boldsymbol{\phi},\boldsymbol{\sigma}^2)=\int p(\mathcal{D}|\boldsymbol{\theta})p(\boldsymbol{\theta}|\boldsymbol{\phi},\boldsymbol{\sigma}^2)\mathbf{d}\boldsymbol{\theta}$

Approximation: Two principled alternative approaches for parameter estimation based on maximizing the predictive likelihood over validation subsets. (query set)

Approximation Method 1: Parameter estimation via the predictive likelihood

(Relation with MAML and iMAML)

Our goal is to minimize the average negative predictive log-likelihood over $T$ validation tasks:

there might be a typo here (missing a minus) in equation (2) in the paper

Justification:

Based on two assumptions:

the training and validation data (support and query set) is distributed i.i.d according to some distribution $\nu(\mathcal{D}_t^{train}, \mathcal{D}_t^{val})$
The law of large numbers.

as $T\rightarrow \infty$ :

$\ell_{\mathrm{PLL}}\left(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right) \rightarrow \mathbb{E}_{\color{blue}\nu\left(\mathcal{D}_{t}^{\text {train }}\right)}\left[\mathrm{KL}\left(\nu\left(\mathcal{D}_{t}^{\text {val }} \mid \mathcal{D}_{t}^{\text {train }}\right) \| p\left(\mathcal{D}_{t}^{\text {val}} \mid \mathcal{D}_{t}^{\text {train}}, \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)\right)\right]+\mathrm{H}\left(\nu\left(\mathcal{D}_{t}^{\text {val }} \mid \mathcal{D}_{t}^{\text {train }}\right)\right)$

$\nu\left(\mathcal{D}_{t}^{\text {val }}\right | \mathcal{D}_{t}^{\text {train}})$ : true distribution
$p\left(\mathcal{D}_{t}^{\text {val}} \mid \mathcal{D}_{t}^{\text{train}}, \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$ : predictive distribution

Thus, minimizing $\ell_{PLL}$ w.r.t. meta parameters corresponds to selecting the predictive distribution that is closest approximately to the true predictive distribution on average.

Proof:

~~Note~~: $\color{blue}\nu(\mathcal{D}_t^{train})$ ?

New Problem: the computation of $\ell_{PLL}$ is not feasible due to the intractable integral in equ.2,

So, we use a simple maximum a posteriori (MAP) approximation of the task parameters:

Where is this from?

(1)

For a specific individual task $t$ :

After having the MAP estimates, we can approximate $\ell_{PLL}(\boldsymbol{\sigma}^2, \boldsymbol{\phi})$ as follows:

Individual task adaptation follows from equation (4)
Meta updating from minimizing equation (5)

So minimizing equation (5) is a bi-level optimization problem which requires solving equation (4).

How to solve this problem?

Compute the gradient of the approximate predictive log-likelihood $\ell_t^{val}(\hat{\boldsymbol{\theta}_t})$ w.r.t. the meta parameters $\mathbf{\Phi}=\left(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$ :

\nabla_{\mathbf{\Phi}} \ell_{t}^{\mathrm{val}}\left(\hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})\right)=\nabla_{\hat{\boldsymbol{\theta}}_{t}}\ell_{t}^{\mathrm{val}}\left(\hat{\boldsymbol{\theta}}_{t}\right)\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi}) \space\space \space \space\space\space \space \space ({\color{red}Eq.*})

$\nabla_{\hat{\boldsymbol{\theta}}_{t}}\ell_{t}^{\mathrm{val}}\left(\hat{\boldsymbol{\theta}}_{t}\right)$ is straightforward.
$\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})$ is not.

Two methods for $\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})$ :

1) $\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})$ explicit computation.

(Relation with MAML)

(6)

If optimizing $\ell_t^{train}$ requires only a small number of local gradient steps, we can compute the update for meta parameters ( $\boldsymbol{\phi}, \boldsymbol{\sigma^2}$ ) with back-propagation through $\hat{\boldsymbol{\theta}_t}$ , yielding (6).

This update reduces to that of MAML if $\sigma^2_m\rightarrow \infty$ for all modules and is thus denoted as $\sigma$ -MAML.

Why? (my explanation)

According to equation (4):

$\ell_{t}^{\operatorname{train}}:=-\log p\left(\mathcal{D}_{t}^{\operatorname{train}} \mid \boldsymbol{\theta}_{t}\right)-\log p\left(\boldsymbol{\theta}_{t} \mid \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$

$p\left(\boldsymbol{\theta}_{t} \mid \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)=\mathcal{N}(\boldsymbol{\theta}_{t} \mid \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{(\theta_t-\phi)^2}{2\sigma^2}}$

2) $\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})$ implicit computation.

(Relation with iMAML)

We are more interested in long adaptation horizons.
Implicit Function Theorem (compute the gradient of $\hat{\boldsymbol{\theta}_t}$ w.r.t $\boldsymbol{\phi}$ and $\boldsymbol{\sigma^2}$ )

(7)

$\mathbf{\Phi}=\left(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$ full vector of meta-parameters
$\mathbf{H}_{a b}=\nabla_{a, b}^{2} \ell_{t}^{\text {train }}$ , namely, $\mathbf{H}_{\theta_t\theta_t}=\nabla_{\theta_t}^{2} \ell_{t}^{\text {train }}$ , $\mathbf{H}_{\theta_t\Phi}=\nabla_{\theta_t,\Phi}^{2} \ell_{t}^{\text {train }}$
Derivatives are evaluated at the stationary point $\boldsymbol{\theta}_{t}=\hat{\boldsymbol{\theta}}_{t}\left(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$

Derivation:

Rewrite equ (4) as:

$\hat{\boldsymbol{\theta}}_{t}\left(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)=\underset{\boldsymbol{\theta}_{t}}{\operatorname{argmin}} \ell_{t}^{\operatorname{train}}\left(\boldsymbol{\theta}_{t}, \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$

where $\ell_{t}^{\operatorname{train}}:=-\log p\left(\mathcal{D}_{t}^{\operatorname{train}} \mid \boldsymbol{\theta}_{t}\right)-\log p\left(\boldsymbol{\theta}_{t} \mid \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$

So, $\hat{\boldsymbol{\theta}}_{t}\left(\boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$ is the stationary point of function $\ell_{t}^{\operatorname{train}}\left(\boldsymbol{\theta}_{t}, \boldsymbol{\sigma}^{2}, \boldsymbol{\phi}\right)$ .

Based on the implicit function theorem,

$\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi}) = -\left(\nabla_{\boldsymbol{\theta}_{t}, \boldsymbol{\theta}_{t}}^{2} \ell_{t}^{\mathrm{train}}\right)^{-1} \nabla_{\boldsymbol{\theta}_{t},\mathbf{\Phi}}^{2} \ell_{t}^{\mathrm{train}}$

Plug the equation above to $({\color{red}Eq.*})$ , END.

Meta update for $\boldsymbol{\phi}$ is equivalent to that of iMAML when $\sigma_m$ is constant for all $m$ .

Why?

According to equation (4):

Also because all modules share a constant variance, $\sigma_m = \sigma$ ,

we can expand the log-prior term for task parameters $\theta_t$ in eqution (4), and plug in the normal prior assumption as follows:

Meta-update of iMAML: $\Delta_{t}^{\mathrm{iMAML}}=\left(\mathbf{I}+\frac{1}{\lambda} \nabla_{\boldsymbol{\theta}_{t}}^{2} \ell_{t}^{\operatorname{train}}\left(\boldsymbol{\theta}_{t}\right)\right)^{-1} \nabla_{\boldsymbol{\theta}_{t}} \ell_{t}^{\mathrm{val}}\left(\boldsymbol{\theta}_{t}\right)$

if we define the regularization scale $\lambda=\frac{1}{\sigma^2},$ and plug in the definition $\ell^{train}_t:=-\log p(\mathcal{D}^{train}_t|\theta_t)$ to the meta-update of iMAML, the finial equation is exactly the same with equation (41).

Approximation Method 2: Estimate $\phi$ via MAP approximation

Relation with Reptile

Integrating out the center parameter $\phi$ , and considering $\phi$ depends on all training data, we can rewrite the predictive likelihood in terms of the joint posterior over ( $\theta_{1:T},\phi$ ), i.e.

Similarly, this is still intractable, we make use of MAP approximation, to approximate both the task parameters and the central meta parameters:

(assume $\phi$ is a flat prior, non-informative, no prior at all, the second term above is dropped)

By plugging the MAP of $\phi$ (Equ: 9) into Equ (8), and scaling by $1/T$ , we derive the approximate predictive log-likelihood as :

compare this to Equ.5

The meta update for $\phi$ can be obtained by differentiating (9) w.r.t. $\phi$ .

To derive the gradient of Eq. (42) with respect to $\sigma^2$ , notice that when $\phi$ is estimated as the MAP on the training subsets of all tasks, it becomes a function of $\sigma^2$ .

For a specific individual task $t$ ,

$\nabla_{\mathbf{\sigma^2}} \ell_{t}^{\mathrm{val}}\left(\hat{\boldsymbol{\theta}}_{t}(\mathbf{\sigma^2})\right)=\nabla_{\hat{\boldsymbol{\theta}}_{t}}\ell_{t}^{\mathrm{val}}\left(\hat{\boldsymbol{\theta}}_{t}\right)\nabla_{\mathbf{\sigma^2}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\sigma^2}) \space\space \space \space\space\space \space \space ({\color{red}Eq.**})$

we take an approximation and ignore the dependence of $\hat{\phi}$ on $\sigma^2$ . Then $\phi$ becomes a constant when computing $\nabla_{\mathbf{\sigma^2}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\sigma^2})$ , and the derivation in above for iMAML applies by replacing $\Phi$ with $\sigma^2$ , giving the implicit gradient in Eq. (10).

Derivation:

Rewrite equ (4) as:

Based on the implicit function theorem,

Plug the equation above to $({\color{red}Eq.**})$ , END.

Reptile is a special case of this method when $\sigma^2_m\rightarrow \infty$ and we choose a learning rate proportional to $\sigma_m^2$ for $\phi_m$ . We thus refer to it as σ-Reptile.

$\Delta_t^{\text{Reptile}}=\phi-\boldsymbol{\theta}_t$

Results:

Module discovery of Image classification

(9-layer network (4 conv layers, 4 batch-norm layers, and a linear output layer))
Similar to ANIL paper (https://openreview.net/forum?id=rkgMkCEtPB)

Module discovery of Text-to-speech

Conclusion

This work proposes a hierarchical Bayesian model for meta-learning that places a shrinkage prior on each module to allow learning the extent to which each module should adapt, without a limitation on the adaptation horizon.

Our formulation includes MAML, Reptile, and iMAML as special cases, empirically discovers a small set of task-specific modules in various domains, and shows promising improvement in a practical TTS application with low data and long task adaptation.

As a general modular meta-learning framework, it allows many interesting extensions, including incorporating alternative Bayesian inference algorithms, modular structure learning, and learn-to-optimize methods.

Adaptation may not generalize to a task with a fundamentally different characteristic from the training distribution. Applying our method to a new task without examining the task similarity runs a risk of transferring induced bias from meta-training to an out-of-sample task

Think of the connection among those papers (probabilistic graphical model explanation):

Meta-Learning Probabilistic Inference For Prediction (https://openreview.net/pdf?id=HkxStoC5F7)
Probabilistic model-agnostic meta-learning (https://arxiv.org/abs/1806.02817)
Amortized Bayesian Meta-Learning (https://openreview.net/pdf?id=rkgpy3C5tX)
Bayesian TAML (https://openreview.net/forum?id=rkeZIJBYvr)

Reference

https://arxiv.org/pdf/1909.05557.pdf

PreviousLearning to learn by gradient descent by gradient descent NextNADS: Neural Architecture Distribution Search for Uncertainty Awareness

Last updated 4 years ago

Was this helpful?

Motivation

Gradient-based Meta-Learning

MAML

Reptile

iMAML

Modular Bayesian Meta-Learning

How to define modules? Modules can correspond to:

Hierarchical Bayesian Model

Meta-Learning as Parameter Estimation

Goal: estimate the parameters ϕ\boldsymbol{\phi}ϕ and σ2\boldsymbol{\sigma}^2σ2 .

Approximation Method 1: Parameter estimation via the predictive likelihood

So minimizing equation (5) is a bi-level optimization problem which requires solving equation (4).

How to solve this problem?

1) ∇Φθ^t(Φ)\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})∇Φ​θ^t​(Φ) explicit computation.

(Relation with MAML)

2) ∇Φθ^t(Φ)\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})∇Φ​θ^t​(Φ) implicit computation.

(Relation with iMAML)

Approximation Method 2: Estimate ϕ\phiϕ via MAP approximation

Results:

Module discovery of Image classification

Module discovery of Text-to-speech

Conclusion

Reference

Goal: estimate the parameters $\boldsymbol{\phi}$ and $\boldsymbol{\sigma}^2$ .

1) $\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})$ explicit computation.

2) $\nabla_{\mathbf{\Phi}} \hat{\boldsymbol{\theta}}_{t}(\mathbf{\Phi})$ implicit computation.

Approximation Method 2: Estimate $\phi$ via MAP approximation