2020
DOI: 10.14778/3407790.3407807
|View full text |Cite
|
Sign up to set email alerts
|

Towards scalable dataframe systems

Abstract: Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in R and Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant ambiguity regarding dataframe semantics. In this paper we lay out a vision and roadmap for scalable dataframe systems. To demonstrate the potential in this area, we report on our experience building Modin, a scaled-up implementation of the most widely-used and comp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
32
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 53 publications
(39 citation statements)
references
References 39 publications
1
32
0
Order By: Relevance
“…We use Dask as the back-end engine of DataPrep.EDA for three reasons: (i) it is lightweight and fast in a single-node environment, (ii) it can scale to a distributed cluster, and (iii) it can optimize the computations required for multiple visualizations via lazy evaluation. We considered other engines like Spark variants [38,73] (PySpark and Koalas) and Modin [60], but found that they were less suitable for DataPrep.EDA than Dask. Since Spark is designed for computations on very big data (TB to PB) in a large cluster, PySpark and Koalas are not lightweight like Dask and have a high scheduling overhead on a single node.…”
Section: Why Daskmentioning
confidence: 99%
“…We use Dask as the back-end engine of DataPrep.EDA for three reasons: (i) it is lightweight and fast in a single-node environment, (ii) it can scale to a distributed cluster, and (iii) it can optimize the computations required for multiple visualizations via lazy evaluation. We considered other engines like Spark variants [38,73] (PySpark and Koalas) and Modin [60], but found that they were less suitable for DataPrep.EDA than Dask. Since Spark is designed for computations on very big data (TB to PB) in a large cluster, PySpark and Koalas are not lightweight like Dask and have a high scheduling overhead on a single node.…”
Section: Why Daskmentioning
confidence: 99%
“…Let be the dataset in Example 3.1 and be an imputation function that associates to the ⊥'s occurring in a feature a the most frequent value occurring in * a . Then, the result of the expression (Zip) ( ) is the following dataset: We note that the data manipulation model presented here has some similarity with the Dataframe algebra [32]. The main difference is that we have focused on a restricted set of core operators (with some of those in [32] missing and others combined in one) with the specific goal of providing a solid basis to an effective technique for capturing data provenance of classical preprocessing operators.…”
Section: Data Manipulation Modelmentioning
confidence: 99%
“…Then, the result of the expression (Zip) ( ) is the following dataset: We note that the data manipulation model presented here has some similarity with the Dataframe algebra [32]. The main difference is that we have focused on a restricted set of core operators (with some of those in [32] missing and others combined in one) with the specific goal of providing a solid basis to an effective technique for capturing data provenance of classical preprocessing operators. We point out that our algebra can be easily extended to include operators implementing other ETL/ELT-like transformations, such as join, intersection, and union, whose fine-grained provenance capture have been described elsewhere [50].…”
Section: Data Manipulation Modelmentioning
confidence: 99%
See 1 more Smart Citation
“…The current data systems use asynchronous and loosely synchronous execution models for running programs at scale. Asynchronous execution is popular in systems such as Spark (Zaharia et al, 2010), Dask (Rocklin, 2015) and Modin (Petersohn et al, 2020). Loosely synchronous distributed execution is used in systems such as PyTorch (Paszke et al, 2019), Cylon (Widanage et al, 2020) and Twister2 (Fox, 2017).…”
Section: Introductionmentioning
confidence: 99%