Meta-Learning of Neural Architectures for Few-Shot Learning

CVPR 2020 8-22-2020

Motivation

Few-shot learning is typically done with a fixed neural architecture. This paper proposes MetaNAS, the first method which fully integrates NAS with gradient-based meta learning.

MetaNAS allows adapting architectures to novel tasks based on few data points with just a few steps of a gradient-based task optimizer. This allows MetaNAS to generate task-specific architectures that are adapted to every task separately (but from a joint meta- learned meta-architecture).

Marrying Gradient-based Meta Learning and Gradient-based NAS

  • αmeta\alpha_{meta} : meta-learned architecture

  • wmetaw_{meta} : corresponding meta-learned weights for the architecture

  • Task Ti\mathcal{T}_i : (Ditr,Ditest)(\mathcal{D_i^{tr}}, \mathcal{D_i^{test}})

Meta-objective:

minα,wLmeta(α,w,ptrain,Φk)=minα,wTiptrainLi(Φk(α,w,Ditr),Ditest)=minα,wTiptrainLi((αTi,wTi),Ditest)\begin{aligned} & \min_{\alpha, w}\mathcal{L}_{meta}(\alpha, w, p^{train}, \Phi^k) \\ = & \min_{\alpha, w}\sum_{\mathcal{T}_i \sim p^{train} }\mathcal{L}_i\left(\Phi^k(\alpha, w, \mathcal{D}_i^{tr}), \mathcal{D}_i^{test}\right) \\ = & \min_{\alpha, w}\sum_{\mathcal{T}_i \sim p^{train} }\mathcal{L}_i\left((\alpha_{\mathcal{T}_i}^{*},w_{\mathcal{T}_i}^{*}), \mathcal{D}_i^{test}\right) \end{aligned}

where αTi,wTi=Φk(α,w,Ditr)=argminα,wL^i(α,w,Ditr)\alpha_{\mathcal{T}_i}^{*},w_{\mathcal{T}_i}^{*} = \Phi^k(\alpha, w, \mathcal{D}_i^{tr})=\operatorname{argmin}_{\alpha, w} \hat{\mathcal{L}}_i (\alpha, w, \mathcal{D}_i^{tr}) are the task-specific architecture and parameters after kk gradient steps, which could approximated using SGD.

  • Li\mathcal{L}_i : query loss for task ii

  • Li^\hat{\mathcal{L}_i} : support loss for task ii

Inner loop update α,and  w\alpha, \text{and} \ \ w with weight learning rate ξtask\xi_{task} and architecture learning rate λtask\lambda_{task} :

(αj+1wj+1)=Φ(αj,wj,Ditr)=(αjξtaskαLT(αj,wj,Ditr)wjλtaskwLT(αj,wj,Ditr))\begin{aligned} \left(\begin{array}{c} \alpha^{j+1} \\ w^{j+1} \end{array}\right) &=\Phi\left(\alpha^{j}, w^{j}, D_{i}^{tr}\right) \\ &=\left(\begin{array}{c} \alpha^{j}-\xi_{\text {task}} \nabla_{\alpha} \mathcal{L}_{\mathcal{T}}\left( \alpha^{j}, w^{j}, D_{i}^{tr}\right) \\ w^{j}-\lambda_{\text {task}} \nabla_{w} \mathcal{L}_{\mathcal{T}}\left(\alpha^{j}, w^{j}, D_{i}^{tr}\right) \end{array}\right) \end{aligned}

Outer loop update:

(αmetai+1wmetai+1)=ΨMAML(αmetai,wmetai,ptrain,Φk)=(αmetaiξmetaαLmeta(αmetai,wmetai,ptrain,Φk)wmetaiλmetawLmeta(αmetai,wmetai,ptrain,Φk))\begin{aligned} \left(\begin{array}{l} \alpha_{\text {meta}}^{i+1} \\ w_{\text {meta}}^{i+1} \end{array}\right) =& \Psi^{M A M L}\left(\alpha_{\text {meta}}^{i}, w_{\text {meta}}^{i}, p^{\text {train}}, \Phi^{k}\right) \\ = & \left(\begin{array}{c} \alpha_{\text {meta}}^{i}-\xi_{\text {meta}} \nabla_{\alpha} \mathcal{L}_{\text {meta}}\left(\alpha_{\text {meta}}^{i}, w_{\text {meta}}^{i}, p^{\text {train}}, \Phi^{k}\right) \\ w_{\text {meta}}^{i}-\lambda_{\text {meta}} \nabla_{w} \mathcal{L}_{\text {meta}}\left(\alpha_{\text {meta}}^{i}, w_{\text {meta}}^{i}, p^{\text {train}}, \Phi^{k}\right) \end{array}\right) \end{aligned}

Reptile could as be used instead of MAML here:

(αmetai+1wmetai+1)=ΨReptile(αmetai,wmetai,ptrain,Φk)=(αmetai+ξmetaTi(αTiαmetai)wmetai+λmetaTi(wTiwmetai))\begin{aligned} \left(\begin{array}{l} \alpha_{\text {meta}}^{i+1} \\ w_{\text {meta}}^{i+1} \end{array}\right) =& \Psi^{Reptile}\left(\alpha_{\text {meta}}^{i}, w_{\text {meta}}^{i}, p^{\text {train}}, \Phi^{k}\right) \\ = & \left(\begin{array}{c} \alpha_{\text {meta}}^{i}+\xi_{\text {meta}} \sum_{\mathcal{T}_i}(\alpha_{\mathcal{T}_i}^{*}-\alpha_{meta}^i)\\ w_{\text {meta}}^{i} +\lambda_{\text {meta}} \sum_{\mathcal{T}_i^{*}}(w_{\mathcal{T}_i}^{*} - w_{meta}^{i}) \end{array}\right) \end{aligned}

Task-dependent Architecture Adaptation

Two modifications to remove the need for retraining.

Reference:

Last updated