BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search

White, Colin; Neiswanger, Willie; Savani, Yash

doi:10.48550/arxiv.1910.11858

Cited by 39 publications

(114 citation statements)

References 30 publications

Supporting

Mentioning

111

Contrasting

Order By: Relevance

“…It uses Gaussian Processes to learn the posterior distribution of the objective function, which is then used to construct an acquisition function to determine the next trial (Snoek et al, 2012). BO is widely used in NAS (White et al, 2019), deep learning hyperparameter tuning (Golovin et al, 2017;Shahriari et al, 2015), system optimization (Lagar-Cavilla et al, 2019;Dalibard et al, 2017), model selection (Malkomes et al, 2016), transfer learning (Ruder & Plank, 2017) and many more (Archetti & Candelieri, 2019;Srinivas et al, 2009;Hutter et al, 2011;Snoek et al, 2012;Wilson et al, 2016) for optimizing with limited computing and time budgets.…”

Section: Nas For Student Modelsmentioning

confidence: 99%

AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

Zhang¹,

Zhou²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, large pre-trained models have significantly improved the performance of various Natural Language Processing (NLP) tasks but they are expensive to serve due to long serving latency and large memory usage. To compress these models, knowledge distillation has attracted an increasing amount of interest as one of the most effective methods for model compression. However, existing distillation methods have not yet addressed the unique challenges of model serving in datacenters, such as handling fast evolving models, considering serving performance, and optimizing for multiple objectives. To solve these problems, we propose AutoDistill, an end-toend model distillation framework integrating model architecture exploration and multi-objective optimization for building hardware-efficient NLP pre-trained models. We use Bayesian Optimization (BO) to conduct multiobjective Neural Architecture Search (NAS) for selecting student model architectures. The proposed search comprehensively considers both prediction accuracy and serving latency on target hardware. We propose Flash Distillation, a model-agnostic technique using a much shorter period of progressive knowledge transfer to distinguish promising student model candidates from less promising ones. Together with the BO algorithm, it significantly reduces the cost during model exploration. The experiments on TPUv4i (Jouppi et al., 2021) show the finding of seven model architectures with better pre-trained accuracy (up to 3.2% higher) and lower inference latency (up to 1.44× faster) than MobileBERT (Sun et al., 2020). By running nine downstream NLP tasks in the GLUE benchmark, the model distilled for pre-training by AutoDistill with 28.5M parameters achieves an 81.69 average score, which is higher than BERT BASE (

show abstract

Section: Nas For Student Modelsmentioning

confidence: 99%

AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

Zhang¹,

Zhou²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We then benchmarked the following HPO methods on the real, surrogate, and tabular benchmarks: Random search (RS), Bayesian optimization (BO), and Hyperband (HB, [5]). BO is configured with algorithm surrogate model either a Gaussian process (BO GP), ensemble of feed-forward neural net- works (NN, [51]) or random forest (BO RF, [52]) and acquisition function optimizer either Nelder-Mead/exhaustive search 2 (* DF [53]) or random search (* RS). See Appendix A.3 for more details.…”

Section: Empirical Investigationsmentioning

confidence: 99%

YAHPO Gym -- An Efficient Multi-Objective Multi-Fidelity Benchmark for Hyperparameter Optimization

Pfisterer¹,

Schneider²,

Moosbauer³

et al. 2021

Preprint

View full text Add to dashboard Cite

When developing and analyzing new hyperparameter optimization (HPO) methods, it is vital to empirically evaluate and compare them on well-curated benchmark suites. In this work, we list desirable properties and requirements for such benchmarks and propose a new set of challenging and relevant multifidelity HPO benchmark problems motivated by these requirements. For this, we revisit the concept of surrogate-based benchmarks and empirically compare them to more widely-used tabular benchmarks, showing that the latter ones may induce bias in performance estimation and ranking of HPO methods. We present a new surrogate-based benchmark suite for multifidelity HPO methods consisting of 9 benchmark collections that constitute over 700 multifidelity HPO problems in total. All our benchmarks also allow for querying of multiple optimization targets, enabling the benchmarking of multi-objective HPO. We examine and compare our benchmark suite with respect to the defined requirements and show that our benchmarks provide viable additions to existing suites. * Equal contribution Preprint. Under review.

show abstract

“…Neural architecture search (NAS) methods can be categorized along three dimensions (Elsken et al, 2019a): search space, search strategy, and performance estimation strategy. Focusing on search strategy, popular methods are given by Bayesian optimization (BO, e.g., Bergstra et al 2013;Domhan et al 2015;Mendoza et al 2016;Kandasamy et al 2018;White et al 2019), evolutionary methods (e.g., Miller et al 1989;Liu et al 2017;Real et al 2017Elsken et al 2019b), reinforcement learning (RL, e.g., Zoph and Le 2017;, and gradient-based algorithms (e.g., Liu et al 2019;Pham et al 2018).…”

Section: Introductionmentioning

confidence: 99%

“…Within the BO framework, BANANAS (White et al, 2019) has emerged as one stateof-the-art algorithm (White et al, 2019;Siems et al, 2020;Guerrero-Viu et al, 2021;White et al, 2021). The two main components of BANANAS are a (truncated) path encoding, where architectures represented as directed acyclic graphs (DAG) are encoded based on the possible paths through that graph, and an ensemble of feed-forward neural networks as surrogate model.…”

Section: Introductionmentioning

confidence: 99%

Mutation is all you need

Schneider¹,

Pfisterer²,

Binder³

et al. 2021

Preprint

View full text Add to dashboard Cite

Neural architecture search (NAS) promises to make deep learning accessible to non-experts by automating architecture engineering of deep neural networks. BANANAS is one stateof-the-art NAS method that is embedded within the Bayesian optimization framework. Recent experimental findings have demonstrated the strong performance of BANANAS on the NAS-Bench-101 benchmark being determined by its path encoding and not its choice of surrogate model. We present experimental results suggesting that the performance of BANANAS on the NAS-Bench-301 benchmark is determined by its acquisition function optimizer, which minimally mutates the incumbent.

show abstract

BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search

Cited by 39 publications

References 30 publications

AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

AutoDistill: an End-to-End Framework to Explore and Distill Hardware-Efficient Language Models

YAHPO Gym -- An Efficient Multi-Objective Multi-Fidelity Benchmark for Hyperparameter Optimization

Mutation is all you need

Contact Info

Product

Resources

About