Efficient Neural Architecture Search via Proximal Iterations

Yao, Quanming; Xu, Jin; Tu, Wei-Wei; Zhu, Zhanxing

doi:10.1609/aaai.v34i04.6143

Cited by 81 publications

(61 citation statements)

References 13 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…proposed DARTS to use search parameters together with a super network, which allows searching with gradient descent. Gradient-based methods (Cai et al, 2018b;Xie et al, 2018;Xu et al, 2019;Yao et al, 2020) attracts researchers' attention since it is computationally efficient and easy to implement. We base our method on DARTS and take one step further to reduce the memory consumption of training the super network.…”

Section: Related Workmentioning

confidence: 99%

Memory-Efficient Differentiable Transformer Architecture Search

Zhao¹,

Dong²,

Shen³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Differentiable architecture search (DARTS) is successfully applied in many vision tasks. However, directly using DARTS for Transformers is memory-intensive, which renders the search process infeasible. To this end, we propose a multi-split reversible network and combine it with DARTS. Specifically, we devise a backpropagation-with-reconstruction algorithm so that we only need to store the last layer's outputs. By relieving the memory burden for DARTS, it allows us to search with larger hidden size and more candidate operations. We evaluate the searched architecture on three sequence-to-sequence datasets, i.e., WMT'14 English-German, WMT'14 English-French, and WMT'14 English-Czech. Experimental results show that our network consistently outperforms standard Transformers across the tasks. Moreover, our method compares favorably with big-size Evolved Transformers, reducing search computation by an order of magnitude.

show abstract

Section: Related Workmentioning

confidence: 99%

Memory-Efficient Differentiable Transformer Architecture Search

Zhao¹,

Dong²,

Shen³

et al. 2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…That is because the relaxed θ cannot converge to a one-hot vector [Zela et al, 2019, Chu et al, 2020, thus removing those operations at the end of search actually lead to a different architecture from the final searching result. Moreover, the mixed strategy must maintain all operators in the whole supernet, which requires more computational resources than the one-hot vector [Yao et al, 2020].…”

Section: Search Algorithmmentioning

confidence: 99%

AutoGEL: An Automated Graph Neural Network with Explicit Link Information

Wang¹,

Di²,

Chen³

2021

Preprint

View full text Add to dashboard Cite

Recently, Graph Neural Networks (GNNs) have gained popularity in a variety of real-world scenarios. Despite the great success, the architecture design of GNNs heavily relies on manual labor. Thus, automated graph neural network (AutoGNN) has attracted interest and attention from the research community, which makes significant performance improvements in recent years. However, existing AutoGNN works mainly adopt an implicit way to model and leverage the link information in the graphs, which is not well regularized to the link prediction task on graphs, and limits the performance of AutoGNN for other graph tasks. In this paper, we present a novel AutoGNN work that explicitly models the link information, abbreviated to AutoGEL. In such a way, AutoGEL can handle the link prediction task and improve the performance of AutoGNNs on the node classification and graph classification task. Specifically, AutoGEL proposes a novel search space containing various design dimensions at both intra-layer and inter-layer designs and adopts a more robust differentiable search algorithm to further improve efficiency and effectiveness. Experimental results on benchmark data sets demonstrate the superiority of AutoGEL on several tasks.

show abstract

“…Environment: Same gpu, same software version; 2. Settings: Batch size (160), init channel scale (24), training 50 epochs; 3. Implementation: Do not query the performance database.…”

Section: Performance Evaluationmentioning

confidence: 99%

“…Some latest research proposed the non-magnitude-based network selection method [18]. Their method inevitably increases the time overhead and we provide the performance comparison in 2.8 2.85±0.02 PC-DARTS [21] 3.6 2.57±0.07 NASP [24] 3.3 2.83±0.09 GAEA+PC-DARTS [11] 3.7 2.50±0.06 DARTS+PT [18] 3.0 2.61±0.08 SDARTS-RS+PT [18] 3.3 2.54±0.10 SGAS+PT [18] 3.9 2.56±0. 3, the best accuracies ever obtained by DARTS are much higher than both the random search and the average performance of the search space, which suggests that the effectiveness of the magnitude may only last a short time during the training of DARTS.…”

Section: Performance Evaluationmentioning

confidence: 99%

Delve into the Performance Degradation of Differentiable Architecture Search

Zhang

Ding

2021

Proceedings of the 30th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Differentiable architecture search (DARTS) is widely considered to be easy to overfit the validation set which leads to performance degradation. We first employ a series of exploratory experiments to verify that neither high-strength architecture parameters regularization nor warmup training scheme can effectively solve this problem. Based on the insights from the experiments, we conjecture that the performance of DARTS does not depend on the well-trained supernet weights and argue that the architecture parameters should be trained by the gradients which are obtained in the early stage rather than the final stage of training. This argument is then verified by exchanging the learning rate schemes of weights and parameters. Experimental results show that the simple swap of the learning rates can effectively solve the degradation and achieve competitive performance. Further empirical evidence suggests that the degradation is not a simple problem of the validation set overfitting but exhibit some links between the degradation and the operation selection bias within bilevel optimization dynamics. We demonstrate the generalization of this bias and propose to utilize this bias to achieve an operation-magnitude-based selective stop. CCS CONCEPTS• Computing methodologies → Search methodologies; Neural networks.

show abstract

Efficient Neural Architecture Search via Proximal Iterations

Cited by 81 publications

References 13 publications

Memory-Efficient Differentiable Transformer Architecture Search

Memory-Efficient Differentiable Transformer Architecture Search

AutoGEL: An Automated Graph Neural Network with Explicit Link Information

Delve into the Performance Degradation of Differentiable Architecture Search

Contact Info

Product

Resources

About