Automatic machine learning (AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal "human-in-the-loop" involvement.We present ARDA, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.
We study the problem of testing whether a matrix A ∈ n×n with bounded entries ( A ∞ ≤ 1) is positive semi-definite (PSD), or -far in 2 2 -distance from the PSD cone, i.e. min B 0 A − B 2Our main algorithmic contribution is a non-adaptive tester which distinguishes between these cases using only Õ(1/ 4 ) queries to the entries of A. For the related " ∞ -gap problem", where A is either PSD or has an eigenvalue satisfying λ i (A) < − n, our algorithm only requires Õ(1/ 2 ) queries, which is optimal up to log(1/ ) factors. Our testers randomly sample a collection of principle sub-matrices and check whether these sub-matrices are PSD. Consequentially, our algorithms achieve one-sided error: whenever they output that A is not PSD, they return a certificate that A has negative eigenvalues.We complement our upper bound for PSD testing with 2 2 -gap by giving a Ω(1/ 2 ) lower bound for any non-adaptive algorithm. Our lower bound construction is general, and can be used to derive lower bounds for a number of spectral testing problems. As an example of the applicability of our construction, we obtain a new Ω(1/ 4 ) sampling lower bound for testing the Schatten-1 norm with a n 1.5 gap, extending a result of Balcan, Li, Woodruff, and Zhang [BLWZ19]. In addition, our hard instance results in new sampling lower bounds for estimating the Ky-Fan Norm, and the cost of rank-k approximations, i.e. A − A k 2 F i>k σ 2 i (A). * Ainesh Bakshi and Rajesh Jayaram would like to thank the partial support from the Office of Naval Research (ONR) grant N00014-18-1-2562, and the National Science Foundation (NSF) under Grant No. CCF-1815840. Throughout the paper, Õ(•) hides log(1/ ) factors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.