A containerized analytics framework for data and compute-intensive pipeline applications

Kaniovskyi, Yuriy; Koehler, Martin; Benkner, Siegfried

doi:10.1145/3070607.3070613

Cited by 3 publications

(4 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, Data Civilizer does not benefit from the systematic use of data context that is described here; it could be extended to do so. On the contrary, Big data analytics platforms such as [7] and [6] focus on optimising the execution of composable data anlytics workflows according to data locality and data flow. Our platform focuses on the design and implementation of a scalable and modularised data wrangling workflow in a domain-independent manner.…”

Section: Related Workmentioning

confidence: 99%

“…Such steps can be carried out using traditional Extract-Transform-Load (ETL) or Big Data analytics platforms [5], [6], [7], both requiring significant manual involvement in specifying, configuring, programming or tuning many of the steps [8], [9]. It is widely reported that intense manual involvement in such processes is expensive (e.g., [10]), often representing more than half the time of data scientists.…”

Section: ç 1 Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Koehler

Abel

Bogatu

et al. 2021

IEEE Trans. Big Data

Self Cite

View full text Add to dashboard Cite

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process are carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. In typical big data applications, we need to ensure that all wrangling steps, including web extraction, selection, integration and cleaning, benefit from automation wherever possible. Towards this goal, in the paper we: (i) introduce a notion of data context, which associates portions of a target schema with extensional data of types that are commonly available; (ii) define a scalable methodology to bootstrap an end-to-end data wrangling process based on data profiling; (iii) describe how data context is used to inform automation in several steps within wrangling, specifically, matching, value format transformation, data repair, and mapping generation and selection to optimise the accuracy, consistency and relevance of the result; and (iv) we evaluate the approach with real estate data and financial data, showing substantial improvements in the results of automated wrangling.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: ç 1 Introductionmentioning

confidence: 99%

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Koehler

Abel

Bogatu

et al. 2021

IEEE Trans. Big Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…Such steps can be carried out using Extract-Transform-Load (ETL) [3] or Big Data analytics platforms [4], both necessitating significant manual involvement in specifying, configuring, programming or tuning many of the steps. It is widely reported that intense manual involvement in such processes is expensive (e.g.…”

Section: Introductionmentioning

confidence: 99%

“…Transformed, integrated and repaired records schematic correspondences. We utilize the Coma 3.0 community edition4 , specifically, the Coma workflow (configuration 7001) combining different metadata-based match heuristic. When data context is provided in D, each such data set is used as a partial extensional representation of the target to carry out instance based matching with the source (line 5).…”

mentioning

confidence: 99%

Data context informed data wrangling

Koehler

Bogatu

Civili

et al. 2017

2017 IEEE International Conference on Big Data (Big Data)

Self Cite

View full text Add to dashboard Cite

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling.

show abstract

Report from the Fourth Workshop on Algorithms andSystems for MapReduce and Beyond (BeyondMR '17)

et al. 2018

View full text Add to dashboard Cite

This report summarizes the presentations and discussions of the fourth workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR '17). The BeyondMR workshop was held in conjunction with the 2017 SIGMOD/PODS conference in Chicago, Illinois, USA on Friday May 19, 2017. The goal of the workshop was to bring together researchers and practitioners to explore algorithms, computational models, languages and interfaces for systems that provide large-scale parallelization and fault tolerance. These include specialized programming and data-management systems based on MapReduce and extensions thereof, graph processing systems and data-intensive workflow systems. The program featured two well-attended invited talks, the first on current and future development in big data processing by Matei Zaharia from Databricks and the University of Stanford, and the second on computational models for the analysis and development of big data processing algorithms by Ke Yi from the Hong Kong University of Science and Technology.

show abstract

A containerized analytics framework for data and compute-intensive pipeline applications

Cited by 3 publications

References 21 publications

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Incorporating Data Context to Cost-Effectively Automate End-to-End Data Wrangling

Data context informed data wrangling

Report from the Fourth Workshop on Algorithms andSystems for MapReduce and Beyond (BeyondMR '17)

Contact Info

Product

Resources

About