DARTS: Differentiable Architecture Search
ICLR 2019 8/19/2020
Last updated
Was this helpful?
ICLR 2019 8/19/2020
Last updated
Was this helpful?
This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner.
The best existing architecture search algorithms are computationally demanding despite their remark- able performance.
Conventional approaches over a discrete and non-differentiable search space (obtaining a state-of-the-art architecture for CIFAR-10 and ImageNet):
Evolution: 3150 GPU days
reinforcement learning: 2000 GPU days
Method in the paper:
based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent.
a novel algorithm for differentiable network architecture search based on bilevel optimization, which is applicable to both convolutional and recurrent architectures.
remarkable efficiency improvement (reducing the cost of architecture discovery to a few GPU days)
Cell: A cell is a directed acyclic graph consisting of an ordered sequence of N nodes. For example, [0,1,2] in Fig.1
Note: candidate operations between two edges:
convolution
max pooling
zero operation: a lack of connection between two nodes
Assumption: the cell has two input nodes and one single output node.
input nodes:
For convolutional cells: the input nodes are defined as the cell outputs in the previous two layers.
example: cell [0,1,2], 0,1:input nodes, 2: output node
For recurrent cells: input nodes are the input at the current step and the state carried from the previous step.
output:
the output of the cell is obtained by applying a reduction operation (e.g., concatenation) to all the intermediate nodes.
Due to the expensive inner optimization, evaluating the architecture gradient exactly can be prohibitive. A simple approximation scheme was proposed:
Fortunately, the complexity can be substantially reduced using the finite difference approximation:
"A cell is a directed acyclic graph consisting of an ordered sequence of N nodes"
How to decide the ordered sequence in practice?
The nodes are ordered randomly at first.
How are Equ. (7) and (8) derived?
Please see the above Approximation section.
To form each node in the discrete architecture, we retain the top-k strongest operations (from distinct nodes) among all non-zero candidate operations collected from all the previous nodes.
Yes. This is a hyperparameter.
Node: each node is a latent representation (e.g. a feature map in CNN)
Edge: each directed edge is associated with some operation that transforms
To make the search space continuous, DARTS relaxes the categorical choice of a particular operation as a softmax over all the operations and the task of architecture search is reduced to learn a set of mixing probabilities
where is a vector of dimension , containing weights between nodes and over different operations.
The bilevel optimization exists as we want to optimize both the network weights and the architecture representation :
is a small scalar
is a hyperparameter?