Rainbow: Adaptive Layout Optimization for Wide Tables

Bian, Haoqiong; Tao, Youxian; Jin, Guodong; Chen, Yueguo; Qin, Xiongpai; Du, Xiaoyong

doi:10.1109/icde.2018.00200

Cited by 4 publications

(13 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We do not consider CPU cost due to its negligible impact compared to I/O cost (existing works [16,3] already proved that this is enough to capture the execution trend). Finally, we do not need any shuffling [3], because we focus only on the first operation loading data and therefore, the networking cost for shuffling is considered to be zero.…”

Section: Estimating Makespanmentioning

confidence: 99%

“…Since huge volumes of data are difficult to be stored on model first load later fashion, organizations end up storing all the the raw data on a distributed file system (e.g., HDFS 3 ) or cloud storage (e.g., Amazon S3 4 ). In addition, they have their own data pipelines to process the raw data, and store it into very wide tables [4,15] using hybrid layouts [3,16], which have built-in support for projection and selection operations, helping in reading data more efficiently from the disk [27,28].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization

et al. 2020

View full text Add to dashboard Cite

Modern organizations typically store their data in a raw format in data lakes. This data is then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus allowing (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks (e.g., Hadoop, Spark) when analytical queries are posed. These frameworks divide the data into multiple partitions and then process each partition in a separate task, consequently creating tasks based on the total file size and not the actual size of the data to be read. This typically leads to launching more tasks than needed, which in turn increases the query execution time and induces significant waste of computing resources. To allow a more efficient use of resources and reduce the query execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. We prototyped our solution for Apache Parquet and Spark and found that our estimations are highly correlated (0.96) with the real executions. Furthermore, using TPC-H we show that our recommended configurations are only 5.6% away from the Pareto front and provide 2.1x speedup compared to default solutions.

show abstract

Section: Estimating Makespanmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The I/O cost depends on the amount of data read within a task and the disk bandwidth. We do not consider CPU cost due to its negligible impact compared to I/O cost (existing works [2,11] already proved that this is enough to capture the execution trend). Finally, we focus on the first operation loading data, thus networking cost for shuffling is also considered to be zero [2].…”

Section: Task's Cost Estimationmentioning

confidence: 99%

“…We do not consider CPU cost due to its negligible impact compared to I/O cost (existing works [2,11] already proved that this is enough to capture the execution trend). Finally, we focus on the first operation loading data, thus networking cost for shuffling is also considered to be zero [2]. However, there is still a networking cost for metadata, because current solutions require to sequentially transfer metadata to all other executors before start processing the data.…”

Section: Task's Cost Estimationmentioning

confidence: 99%

“…These frameworks provide distributed storage (e.g., HDFS 5 ) and distributed processing [6]. In addition, for more efficient analysis, very wide tables [3,10] are being used to store non-normalized data in hybrid layouts [2,11]. Through their built-in operations (e.g., projection, selection), these layouts read data more efficiently from the disk.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automatically Configuring Parallelism for Hybrid Layouts

Munir

Abelló

Romero

et al. 2019

Communications in Computer and Information Science

View full text Add to dashboard Cite

Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.). To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.

show abstract

Pixels: An Efficient Column Store for Cloud Data Lakes

Bian

Ailamaki

2022

2022 IEEE 38th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Rainbow: Adaptive Layout Optimization for Wide Tables

Cited by 4 publications

References 6 publications

Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization

Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization

Automatically Configuring Parallelism for Hybrid Layouts

Pixels: An Efficient Column Store for Cloud Data Lakes

Contact Info

Product

Resources

About