Network Architecture Search for Domain Adaptation

arxiv 10-2-2020

Motivation

While such models typically learn feature mapping from one domain to another or derive a joint representation across domains, the developed models have limited capacities in deriving an optimal neural architecture specific for domain transfer.

To efficiently devise a neural architecture across different data domains, the authors propose a novel learning task called NASDA (Neural Architecture Search for Domain Adaptation).

The ultimate goal of NASDA is to minimize the validation loss of the target domain. We postulate that a solution to NASDA should not only minimize validation loss of the source domain, but should also reduce the domain gap between the source and target.

Learning Objective

Where $\Phi^{*}=\Phi_{\alpha,w^{*}(\alpha)}$ , and $disc(\Phi^{*}(\mathbf{x}^s),\Phi^{*}(\mathbf{x}^t))$ denotes the domain discrepancy between the source and target.

Note that in unsupervised domain adaptation, $L_{train}^t$ and $L_{test}^t$ cannot be computed directly due to the lack of label in the target domain.

The algorithm is comprised with two training phases, as shown in the above figure. The first is the neural architecture searching phase, aiming to derive an optimal neural architecture ( $\alpha^{*}$ ), following the learning schema in previous slide. The second training phase aims to learn a good feature generator with task-specific loss, based on the derived $\alpha^{*}$ from the first phase.

NAS for Domain Adaptation

Inspired by the gradient-based hyperparameter optimization, we set the architecture parameters α as a special type of hyperparameter. This implies a bilevel optimization problem with α as the upper-level variable and w as the lower-level variable. In practice, we utilize the MK-MMD to evaluate the domain discrepancy. The optimization can be summarized as follows:

Where λ is the trade-off hyperparameter between the source validation loss and the MK-MMD loss.

Adversarial Training for Domain Adaptation

By the discussed neural architecture searching, we have derived the optimal cell structure ( $\alpha^{*}$ ) for domain adaptation. We then stack the cells to derive our feature generator G. Assume C includes N independent classifiers $\{C^{(i)}\}_{i=1}^N$ and denote $p_i(y|x)$ as the K-way probabilistic outputs of $C^{(i)}$ , where K is the category number.

The high-level intuition is to consolidate the feature generator G such that it can make the diversified C generate similar outputs. To this end, our training process include three steps:

(1) train G and C on $D^s$ to obtain task-specific features,

(2) fix G and train C to make $\{C^{(i)}\}_{i=1}^N$ have diversified output,

(3) fix C and train G to minimize the output discrepancy between C.

Reference: Maximum Classifier Discrepancy for Unsupervised Domain Adaptation

Reference

https://arxiv.org/pdf/2008.05706.pdf

PreviousMeta-Learning Acquisition Functions for Transfer Learning in Bayesian Optimization NextContinuous Meta Learning without tasks

Last updated 4 years ago

Was this helpful?