Joseph K. Bradley scite author profile

Spark SQL is a new module in Apache Spark that integrates relational processing with Spark's functional programming API. Built on our experience with Shark, Spark SQL lets Spark programmers leverage the benefits of relational processing (e.g., declarative queries and optimized storage), and lets SQL users call complex analytics libraries in Spark (e.g., machine learning). Compared to previous systems, Spark SQL makes two main additions. First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API that integrates with procedural Spark code. Second, it includes a highly extensible optimizer, Catalyst, built using features of the Scala programming language, that makes it easy to add composable rules, control code generation, and define extension points. Using Catalyst, we have built a variety of features (e.g., schema inference for JSON, machine learning types, and query federation to external databases) tailored for the complex needs of modern data analysis. We see Spark SQL as an evolution of both SQL-on-Spark and of Spark itself, offering richer APIs and optimizations while keeping the benefits of the Spark programming model.

show abstract

MLlib: Machine Learning in Apache Spark

Meng¹,

Bradley²,

Yavuz³

et al. 2015

Preprint

212

View full text Add to dashboard Cite

Apache Spark is a popular open-source platform for large-scale data processing that is well-suited for iterative machine learning tasks. In this paper we present MLlib, Spark's open-source distributed machine learning library. MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives. Shipped with Spark, MLlib supports several languages and provides a high-level API that leverages Spark's rich ecosystem to simplify the development of end-to-end machine learning pipelines. MLlib has experienced a rapid growth due to its vibrant open-source community of over 140 contributors, and includes extensive documentation to support further growth and to let users quickly get up to speed.

show abstract

Parallel Coordinate Descent for L1-Regularized Loss Minimization

Bradley¹,

Kyrola²,

Bickson³

et al. 2011

Preprint

View full text Add to dashboard Cite

We propose Shotgun, a parallel coordinate descent algorithm for minimizing L 1regularized losses. Though coordinate descent seems inherently sequential, we prove convergence bounds for Shotgun which predict linear speedups, up to a problemdependent limit. We present a comprehensive empirical study of Shotgun for Lasso and sparse logistic regression. Our theoretical predictions on the potential for parallelism closely match behavior on real data. Shotgun outperforms other published solvers on a range of large problems, proving to be one of the most scalable algorithms for L 1 .

show abstract

The SPRIGHT algorithm for robust sparse Hadamard Transforms

Xiao¹,

Bradley²,

Pawar³

et al. 2014

View full text Add to dashboard Cite

In this paper, we consider the problem of computing a K -sparse N -point Hadamard Transforms (HT) from noisy time domain samples, where K = O(Na) scales sub-linearly in N for some a E (0,1). The SParse Robust Iterative Graph based Hadamard Transform (SPRIGHT) algorithm is proposed to recover the sparse HT coefficients in a stable manner that is robust to additive Gaussian noise. In particular, it is shown that the K -sparse HT of the signal can be reconstructed from noisy time domain samples with a vanishing error probability using the same sample complexity O( K log N) as in the noiseless case of [1] and computational complexity l O(N log N). Last but not least, given the complexity orders of the SPRIGHT algorithm, our numerical experiments further validate that the big-Oh constants in the complexity are small.

show abstract

Addressing Challenges in Data Science

Bradley

2019

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Joseph K. Bradley

Spark SQL

MLlib: Machine Learning in Apache Spark

Parallel Coordinate Descent for L1-Regularized Loss Minimization

The SPRIGHT algorithm for robust sparse Hadamard Transforms

Addressing Challenges in Data Science

Contact Info

Product

Resources

About