📒
PaperNotes
  • PAPER NOTES
  • Meta-Learning with Implicit Gradient
  • DARTS: Differentiable Architecture Search
  • Meta-Learning of Neural Architectures for Few-Shot Learning
  • Towards Fast Adaptation of Neural Architectures with Meta Learning
  • Editable Neural Networks
  • ANIL (Almost No Inner Loop)
  • Meta-Learning Representation for Continual Learning
  • Learning to learn by gradient descent by gradient descent
  • Modular Meta-Learning with Shrinkage
  • NADS: Neural Architecture Distribution Search for Uncertainty Awareness
  • Modular Meta Learning
  • Sep
    • Incremental Few Shot Learning with Attention Attractor Network
    • Learning Steady-States of Iterative Algorithms over Graphs
      • Experiments
    • Learning combinatorial optimization algorithms over graphs
    • Meta-Learning with Shared Amortized Variational Inference
    • Concept Learners for Generalizable Few-Shot Learning
    • Progressive Graph Learning for Open-Set Domain Adaptation
    • Probabilistic Neural Architecture Search
    • Large-Scale Long-Tailed Recognition in an Open World
    • Learning to stop while learning to predict
    • Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift
    • Learning to Generalize: Meta-Learning for Domain Generalization
  • Oct
    • Meta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization
    • Network Architecture Search for Domain Adaptation
    • Continuous Meta Learning without tasks
    • Learning Causal Models Online
    • Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples
    • Conditional Neural Progress (CNPs)
    • Reviving and Improving Recurrent Back-Propagation
    • Meta-Q-Learning
    • Learning Self-train for semi-supervised few shot classification
    • Watch, Try, Learn: Meta-Learning from Demonstrations and Rewards
  • Nov
    • Neural Process
    • Adversarially Robust Few-Shot Learning: A Meta-Learning Approach
    • Learning to Adapt to Evolving Domains
  • Tutorials
    • Relax constraints to continuous
    • MAML, FO-MAML, Reptile
    • Gradient Descent
      • Steepest Gradient Descent
      • Conjugate Gradient Descent
  • KL, Entropy, MLE, ELBO
  • Coding tricks
    • Python
    • Pytorch
  • ml
    • kmeans
Powered by GitBook
On this page
  • Motivation
  • Importance-weighted Monte Carlo empirical Bayes
  • Reference

Was this helpful?

  1. Sep

Probabilistic Neural Architecture Search

PreviousProgressive Graph Learning for Open-Set Domain AdaptationNextLarge-Scale Long-Tailed Recognition in an Open World

Last updated 4 years ago

Was this helpful?

Motivation

Most existing methods of NAS cannot be directly applied to large scale problems because of their prohibitive computational complexity or high memory usage.

This paper proposes a Probabilistic approach to NAS (PARSEC) that drastically reduces memory requirements while maintaining SOTA computational complexity, making it possible to directly search over more complex architectures and larger datasets.

  • a memory-efficient sampling procedure wherein we learn a probability distribution over high-performing neural network architectures.

  • Importantly, this framework enables us to transfer the distribution of architectures learnt on smaller problems to larger ones, further reducing the computational cost.

Importance-weighted Monte Carlo empirical Bayes

  • p(α∣π):p(\alpha|\pi): p(α∣π): a prior on the choices of inputs and operations that define the cell, where hyper-parameters π\piπ are the probabilities corresponding to the different choices.

  • y:targety: \text{target}y:target

  • X:input\mathbf{X}: \text{input}X:input

  • v:v:v: network weights

Given the estimator:

From equation (7):

∇v,πlog⁡p(y∣X,v,π)=1p(y∣X,v,π)∫∇v,πlog⁡p(y,α∣X,v,π)p(y,α∣X,v,π)dα=1p(y∣X,v,π)∫∇v,πlog⁡p(y,α∣X,v,π)p(y∣X,v,α)p(α∣π)dα=1p(y∣X,v,π)∑k=1K∇v,πlog⁡p(y,αk∣X,v,π)p(y∣X,v,αk)=1p(y∣X,v,π)∑k=1K∇v,π(log⁡p(y∣X,v,αk)+log⁡p(αk∣π))p(y∣X,v,αk)\nabla_{v, \pi} \log p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi})\\ =\frac{1}{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi})} \int \nabla_{\boldsymbol{v}, \pi} \log p(\boldsymbol{y}, \boldsymbol{\alpha} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi}) p(\boldsymbol{y}, \boldsymbol{\alpha} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi}) \mathrm{d} \alpha \\ = \frac{1}{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi})} {\color{red}\int} \nabla_{\boldsymbol{v}, \pi} \log p(\boldsymbol{y}, \boldsymbol{\alpha} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi}) p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}) {\color{red}p(\boldsymbol{\alpha} \mid \boldsymbol{\pi}) \mathrm{d} \alpha} \\ = \frac{1}{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi})} {\color{red}\sum_{k=1}^K} \nabla_{\boldsymbol{v}, \pi} \log p(\boldsymbol{y}, {\color{red}\boldsymbol{\alpha}_k} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi}) p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, {\color{red}\boldsymbol{\alpha}_k}) \\ = \frac{1}{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi})} \sum_{k=1}^K \nabla_{\boldsymbol{v}, \pi} (\log p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}_k)+\log p(\boldsymbol{\alpha}_k|\boldsymbol{\pi})) p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}_k) \\ ∇v,π​logp(y∣X,v,π)=p(y∣X,v,π)1​∫∇v,π​logp(y,α∣X,v,π)p(y,α∣X,v,π)dα=p(y∣X,v,π)1​∫∇v,π​logp(y,α∣X,v,π)p(y∣X,v,α)p(α∣π)dα=p(y∣X,v,π)1​∑k=1K​∇v,π​logp(y,αk​∣X,v,π)p(y∣X,v,αk​)=p(y∣X,v,π)1​∑k=1K​∇v,π​(logp(y∣X,v,αk​)+logp(αk​∣π))p(y∣X,v,αk​)

=1p(y∣X,v,π)∑k=1K∇v(log⁡p(y∣X,v,αk)p(y∣X,v,αk))+∇πlog⁡p(αk∣π)p(y∣X,v,αk)= \frac{1}{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi})} \sum_{k=1}^K \nabla_{\boldsymbol{v}} (\log p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}_k)p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}_k))+\nabla_{\boldsymbol\pi}\log p(\boldsymbol{\alpha}_k|\boldsymbol{\pi}) p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}_k) \\=p(y∣X,v,π)1​∑k=1K​∇v​(logp(y∣X,v,αk​)p(y∣X,v,αk​))+∇π​logp(αk​∣π)p(y∣X,v,αk​)

=∑k=1Kp(y∣X,v,αk)p(y∣X,v,π)∇vlog⁡p(y∣X,v,αk)+∑k=1Kp(y∣X,v,αk)p(y∣X,v,π)∇πlog⁡p(αk∣π)= \sum_{k=1}^K \frac{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}_k)}{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi})}\nabla_{\boldsymbol{v}} \log p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}_k)+\sum_{k=1}^{K}\frac{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\alpha}_k)}{p(\boldsymbol{y} \mid \boldsymbol{X}, \boldsymbol{v}, \boldsymbol{\pi})}\nabla_{\boldsymbol\pi}\log p(\boldsymbol{\alpha}_k|\boldsymbol{\pi}) \\=∑k=1K​p(y∣X,v,π)p(y∣X,v,αk​)​∇v​logp(y∣X,v,αk​)+∑k=1K​p(y∣X,v,π)p(y∣X,v,αk​)​∇π​logp(αk​∣π)

Reference

https://arxiv.org/pdf/1902.05116.pdf