Meta-Learning of Neural Architectures for Few-Shot Learning

CVPR 2020 8-22-2020

Motivation

Few-shot learning is typically done with a fixed neural architecture. This paper proposes MetaNAS, the first method which fully integrates NAS with gradient-based meta learning.

MetaNAS allows adapting architectures to novel tasks based on few data points with just a few steps of a gradient-based task optimizer. This allows MetaNAS to generate task-specific architectures that are adapted to every task separately (but from a joint meta- learned meta-architecture).

Marrying Gradient-based Meta Learning and Gradient-based NAS

$\alpha_{meta}$ : meta-learned architecture
$w_{meta}$ : corresponding meta-learned weights for the architecture
Task $\mathcal{T}_i$ : $(\mathcal{D_i^{tr}}, \mathcal{D_i^{test}})$

Meta-objective:

\begin{aligned} & \min_{\alpha, w}\mathcal{L}_{meta}(\alpha, w, p^{train}, \Phi^k) \\ = & \min_{\alpha, w}\sum_{\mathcal{T}_i \sim p^{train} }\mathcal{L}_i\left(\Phi^k(\alpha, w, \mathcal{D}_i^{tr}), \mathcal{D}_i^{test}\right) \\ = & \min_{\alpha, w}\sum_{\mathcal{T}_i \sim p^{train} }\mathcal{L}_i\left((\alpha_{\mathcal{T}_i}^{*},w_{\mathcal{T}_i}^{*}), \mathcal{D}_i^{test}\right) \end{aligned}

where $\alpha_{\mathcal{T}_i}^{*},w_{\mathcal{T}_i}^{*} = \Phi^k(\alpha, w, \mathcal{D}_i^{tr})=\operatorname{argmin}_{\alpha, w} \hat{\mathcal{L}}_i (\alpha, w, \mathcal{D}_i^{tr})$ are the task-specific architecture and parameters after $k$ gradient steps, which could approximated using SGD.

$\mathcal{L}_i$ : query loss for task $i$
$\hat{\mathcal{L}_i}$ : support loss for task $i$

Inner loop update $\alpha, \text{and} \ \ w$ with weight learning rate $\xi_{task}$ and architecture learning rate $\lambda_{task}$ :

\begin{aligned} \left(\begin{array}{c} \alpha^{j+1} \\ w^{j+1} \end{array}\right) &=\Phi\left(\alpha^{j}, w^{j}, D_{i}^{tr}\right) \\ &=\left(\begin{array}{c} \alpha^{j}-\xi_{\text {task}} \nabla_{\alpha} \mathcal{L}_{\mathcal{T}}\left( \alpha^{j}, w^{j}, D_{i}^{tr}\right) \\ w^{j}-\lambda_{\text {task}} \nabla_{w} \mathcal{L}_{\mathcal{T}}\left(\alpha^{j}, w^{j}, D_{i}^{tr}\right) \end{array}\right) \end{aligned}

Outer loop update:

\begin{aligned} \left(\begin{array}{l} \alpha_{\text {meta}}^{i+1} \\ w_{\text {meta}}^{i+1} \end{array}\right) =& \Psi^{M A M L}\left(\alpha_{\text {meta}}^{i}, w_{\text {meta}}^{i}, p^{\text {train}}, \Phi^{k}\right) \\ = & \left(\begin{array}{c} \alpha_{\text {meta}}^{i}-\xi_{\text {meta}} \nabla_{\alpha} \mathcal{L}_{\text {meta}}\left(\alpha_{\text {meta}}^{i}, w_{\text {meta}}^{i}, p^{\text {train}}, \Phi^{k}\right) \\ w_{\text {meta}}^{i}-\lambda_{\text {meta}} \nabla_{w} \mathcal{L}_{\text {meta}}\left(\alpha_{\text {meta}}^{i}, w_{\text {meta}}^{i}, p^{\text {train}}, \Phi^{k}\right) \end{array}\right) \end{aligned}

Reptile could as be used instead of MAML here:

\begin{aligned} \left(\begin{array}{l} \alpha_{\text {meta}}^{i+1} \\ w_{\text {meta}}^{i+1} \end{array}\right) =& \Psi^{Reptile}\left(\alpha_{\text {meta}}^{i}, w_{\text {meta}}^{i}, p^{\text {train}}, \Phi^{k}\right) \\ = & \left(\begin{array}{c} \alpha_{\text {meta}}^{i}+\xi_{\text {meta}} \sum_{\mathcal{T}_i}(\alpha_{\mathcal{T}_i}^{*}-\alpha_{meta}^i)\\ w_{\text {meta}}^{i} +\lambda_{\text {meta}} \sum_{\mathcal{T}_i^{*}}(w_{\mathcal{T}_i}^{*} - w_{meta}^{i}) \end{array}\right) \end{aligned}

Task-dependent Architecture Adaptation

Two modifications to remove the need for retraining.

Reference:

PreviousDARTS: Differentiable Architecture Search NextTowards Fast Adaptation of Neural Architectures with Meta Learning

Last updated 1 year ago

Was this helpful?