Data Science Through the Looking Glass

Psallidas, Fotis; Zhu, Yiwen; Karlaš, Bojan; Henkel, Jordan; Interlandi, Matteo; Krishnan, Subru; Kroth, Brian; Venkatesh, Emani,; Wu, Wentao; Zhang, Ce; Weimer, Markus; Floratou, Avrilia; Curino, Carlo; Karanasos, Konstantinos

doi:10.1145/3552490.3552496

Cited by 16 publications

(9 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(i.e., no deep neural networks). Traditional methods are the state-of-the-art over structured data [32], and it is still the more widely-used type of ML [25], [26]. Nevertheless, we did test the performance of a shallow neural network in Section 5.8.2.…”

Section: Background: ML Workflowmentioning

confidence: 99%

“…Manuscript received XXX (e.g., in [25] we found that pipelines can have up to hundreds of operators); 2) models are often trained once and served many times (e.g., rendering of web pages based on users' profiles, batch prediction of asset prices based on historical data), and this pattern appears quite amenable for in-DBMS execution; 3) applications where prediction serving will likely be used (e.g., websites, smart BI dashboards) are often backed by a DBMS; 4) the top used operators in practical data science over tabular data are not compute-heavy neural networks, but rather memory-intensive operations (such as one-hot encoding or tree ensemble methods [25], [26]) which should benefit from in-DBMS execution; 5) when data already resides in a database, execution of in-DBMS predictions is a natural choice, whereas a different solution will require pulling the data out of the database. This not only is a path not always practicable, for instance, if for security reasons data cannot be moved outside the database, but it also causes performance costs, while making it difficult to enforce the "Enterprise-grade" features without resorting to bespoken solutions (and likely increasing the technical debt).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Pushing ML Predictions Into DBMSs

Paganelli

Sottovia

Park

et al. 2023

IEEE Trans. Knowl. Data Eng.

Self Cite

View full text Add to dashboard Cite

In the past decade, many approaches have been suggested to execute ML workloads on a DBMS. However, most of them have looked at in-DBMS ML from a training perspective, whereas ML inference has been largely overlooked. We think that this is an important gap to fill for two main reasons: (1) in the near future, every application will be infused with some sort of ML capability; (2) behind every web page, application, and enterprise there is a DBMS, whereby in-DBMS inference is an appealing solution both for efficiency (e.g., less data movement), performance (e.g., cross-optimizations between relational operators and ML) and governance. In this paper, we study whether DBMSs are a good fit for prediction serving. We introduce a technique for translating trained ML pipelines containing both featurizers (e.g., one-hot encoding) and models (e.g., linear and tree-based models) into SQL queries, and we compare in-DBMS performance against popular ML frameworks such as Sklearn and ML.NET. Our experiments show that, when pushed inside a DBMS, trained ML pipelines can have performance comparable to ML frameworks in several scenarios, while they perform quite poorly on text featurization and over (even simple) neural networks.

show abstract

Section: Background: ML Workflowmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Pushing ML Predictions Into DBMSs

Paganelli

Sottovia

Park

et al. 2023

IEEE Trans. Knowl. Data Eng.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Traditional ML is most widely used. According to the latest Kaggle survey [32] and an analysis of publicly available Python notebooks [69], traditional ML algorithms, such as linear/logistic regression and tree-based models (decision trees, random forests, gradient boosting) are the most popular by a large margin. 80% of the Kaggle responders use them, as opposed to 43% for neural networks.…”

Section: Motivationmentioning

confidence: 99%

“…Trained pipelines. We evaluate Raven over four popular traditional ML model types [32,69], namely, logistic regression (LR), decision tree (DT), gradient boosting (GB), and random forest (RF). Each trained pipeline includes featurizers for numerical and categorical inputs: we normalize the former using standard scaling, and encode the latter using one-hot encoding [79,80].…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…These queries often include additional data processing operators (e.g., filters or joins), implementing prediction-specific logic. As we discuss in §2, ML inference drives the majority of the cost (up to 90% [31]) associated with ML in the enterprise, whereas traditional ML models (i.e., nonneural networks, such as linear regression and tree-based models) are still the most widely used by a large margin [32,69]. Moreover, based on the analysis of 130 customer engagements at Microsoft, we found that batch inference is chosen over online inference in most enterprise scenarios.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

End-to-end Optimization of Machine Learning Prediction Queries

Park

Saur

Banda

et al. 2022

Proceedings of the 2022 International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving significant opportunities for optimization unexplored.We present Raven, a production-ready system for optimizing prediction queries. Raven follows the enterprise architectural trend of collocating data and ML runtimes. It relies on a unified intermediate representation that captures both data and ML operators in a single graph structure to unlock two families of optimizations. First, it employs logical optimizations that pass information between the data part (and the properties of the underlying data) and the ML part to optimize each other. Second, it introduces logical-tophysical transformations that allow operators to be executed on different runtimes (relational, ML, and DNN) and hardware (CPU, GPU). Novel data-driven optimizations determine the runtime to be used for each part of the query to achieve optimal performance. Our evaluation shows that Raven improves performance of prediction queries on Apache Spark and SQL Server by up to 13.1× and 330×, respectively. For complex models where GPU acceleration is beneficial, Raven provides up to 8× speedup compared to state-of-the-art systems.

show abstract

Enhancing Solar Forecasting Accuracy with Sequential Deep Artificial Neural Network and Hybrid Random Forest and Gradient Boosting Models across Varied Terrains

Hanif,

Siddique,

et al. 2024

Advcd Theory and Sims

View full text Add to dashboard Cite

Effective solar energy utilization demands improvements in forecasting due to the unpredictable nature of solar irradiance (SI). This study introduces and rigorously tests two innovative forecasting models across different locations: the Sequential Deep Artificial Neural Network (SDANN) and the Deep Hybrid Random Forest Gradient Boosting (RFGB). SDANN, leveraging deep learning, aims to identify complex patterns in weather data, while RFGB, combining Random Forest and Gradient Boosting, proves more effective by offering a superior balance of efficiency and accuracy. The research highlights the SDANN model's deep learning capabilities along with the RFGB model's unique blend and their comparative success over existing models such as eXtreme Gradient Boosting (XGBOOST), Categorical Boosting (CatBOOST), Gated Recurrent Unit (GRU), and a K‐Nearest Neighbors (KNN) and XGBOOST hybrid. With the lowest Mean Squared Error (147.22), Mean Absolute Error (8.77), and a high R2 value (0.80) in a studied region, RFGB stands out. Additionally, detailed ablation studies on meteorological feature impacts on model performance further enhance accuracy and adaptability. By integrating cutting‐edge AI in SI forecasting, this research not only advances the field but also sets the stage for future renewable energy strategies and global policy‐making.

show abstract

Data Science Through the Looking Glass

Cited by 16 publications

References 11 publications

Pushing ML Predictions Into DBMSs

Pushing ML Predictions Into DBMSs

End-to-end Optimization of Machine Learning Prediction Queries

Enhancing Solar Forecasting Accuracy with Sequential Deep Artificial Neural Network and Hybrid Random Forest and Gradient Boosting Models across Varied Terrains

Contact Info

Product

Resources

About