Towards scalable dataframe systems

Petersohn, Devin; Macke, Stephen; Xin, Doris; Ma, William; Lee, Doris; Mo, Xiangxi; Gonzalez, Joseph E.; Hellerstein, Joseph M.; Joseph, Anthony D.; Parameswaran, Aditya

doi:10.14778/3407790.3407807

Cited by 53 publications

(39 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use Dask as the back-end engine of DataPrep.EDA for three reasons: (i) it is lightweight and fast in a single-node environment, (ii) it can scale to a distributed cluster, and (iii) it can optimize the computations required for multiple visualizations via lazy evaluation. We considered other engines like Spark variants [38,73] (PySpark and Koalas) and Modin [60], but found that they were less suitable for DataPrep.EDA than Dask. Since Spark is designed for computations on very big data (TB to PB) in a large cluster, PySpark and Koalas are not lightweight like Dask and have a high scheduling overhead on a single node.…”

Section: Why Daskmentioning

confidence: 99%

DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python

Peng

Lockhart

et al. 2021

Proceedings of the 2021 International Conference on Management of Data

View full text Add to dashboard Cite

Exploratory Data Analysis (EDA) is a crucial step in any data science project. However, existing Python libraries fall short in supporting data scientists to complete common EDA tasks for statistical modeling. Their API design is either too low level, which is optimized for plotting rather than EDA, or too high level, which is hard to specify more fine-grained EDA tasks. In response, we propose DataPrep.EDA, a novel task-centric EDA system in Python. Dat-aPrep.EDA allows data scientists to declaratively specify a wide range of EDA tasks in different granularity with a single function call. We identify a number of challenges to implement Dat-aPrep.EDA, and propose effective solutions to improve the scalability, usability, customizability of the system. In particular, we discuss some lessons learned from using Dask to build the data processing pipelines for EDA tasks and describe our approaches to accelerate the pipelines. We conduct extensive experiments to compare DataPrep.EDA with Pandas-profiling, the state-of-the-art EDA system in Python. The experiments show that DataPrep.EDA significantly outperforms Pandas-profiling in terms of both speed and user experience. DataPrep.EDA is open-sourced as an EDA component of DataPrep: https://github.com/sfu-db/dataprep.

show abstract

Section: Why Daskmentioning

confidence: 99%

DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python

Peng

Lockhart

et al. 2021

Proceedings of the 2021 International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…Let be the dataset in Example 3.1 and be an imputation function that associates to the ⊥'s occurring in a feature a the most frequent value occurring in * a . Then, the result of the expression (Zip) ( ) is the following dataset: We note that the data manipulation model presented here has some similarity with the Dataframe algebra [32]. The main difference is that we have focused on a restricted set of core operators (with some of those in [32] missing and others combined in one) with the specific goal of providing a solid basis to an effective technique for capturing data provenance of classical preprocessing operators.…”

Section: Data Manipulation Modelmentioning

confidence: 99%

“…Then, the result of the expression (Zip) ( ) is the following dataset: We note that the data manipulation model presented here has some similarity with the Dataframe algebra [32]. The main difference is that we have focused on a restricted set of core operators (with some of those in [32] missing and others combined in one) with the specific goal of providing a solid basis to an effective technique for capturing data provenance of classical preprocessing operators. We point out that our algebra can be easily extended to include operators implementing other ETL/ELT-like transformations, such as join, intersection, and union, whose fine-grained provenance capture have been described elsewhere [50].…”

Section: Data Manipulation Modelmentioning

confidence: 99%

“…This is the One Hot encoding of 7 different columns, which generates 90 new features (from 15 to 105 columns). The number of records remains unchanged (32,561), so there will be 32561*90 new provenance entities. Other operations that take time include B0, which selects 9 columns of data, removes 44 features, and generates 7214*44 provenance records and A3, another One Hot encoding of 11 different columns, generating 38 new features, and thus 1000x38 new provenance records.…”

Section: Analysis With Real World Pipelinesmentioning

confidence: 99%

See 1 more Smart Citation

Capturing and querying fine-grained provenance of preprocessing pipelines in data science

et al. 2020

View full text Add to dashboard Cite

Data processing pipelines that are designed to clean, transform and alter data in preparation for learning predictive models, have an impact on those models' accuracy and performance, as well on other properties, such as model fairness. It is therefore important to provide developers with the means to gain an in-depth understanding of how the pipeline steps affect the data, from the raw input to training sets ready to be used for learning. While other efforts track creation and changes of pipelines of relational operators, in this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, and the definition of provenance patterns for each of them, and (ii) a prototype implementation of an application-level provenance capture library that works alongside Python. We report on provenance processing and storage overhead and scalability experiments, carried out over both real ML benchmark pipelines and over TCP-DI, and show how the resulting provenance can be used to answer a suite of provenance benchmark queries that underpin some of the developers' debugging questions, as expressed on the Data Science Stack Exchange.

show abstract

“…The current data systems use asynchronous and loosely synchronous execution models for running programs at scale. Asynchronous execution is popular in systems such as Spark (Zaharia et al, 2010), Dask (Rocklin, 2015) and Modin (Petersohn et al, 2020). Loosely synchronous distributed execution is used in systems such as PyTorch (Paszke et al, 2019), Cylon (Widanage et al, 2020) and Twister2 (Fox, 2017).…”

Section: Introductionmentioning

confidence: 99%

HPTMT Parallel Operators for High Performance Data Science and Data Engineering

et al. 2022

View full text Add to dashboard Cite

Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together. Our analysis show that the proposed system architecture is better suited for high performance computing environments compared to the current big data processing systems. Furthermore our proposed system emphasizes the importance of efficient compact data structures such as Apache Arrow tabular data representation defined for high performance. Thus the system integration we proposed scales a sequential computation to a distributed computation retaining optimum performance along with highly usable application programming interface.

show abstract

Towards scalable dataframe systems

Cited by 53 publications

References 39 publications

DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python

DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python

Capturing and querying fine-grained provenance of preprocessing pipelines in data science

HPTMT Parallel Operators for High Performance Data Science and Data Engineering

Contact Info

Product

Resources

About