Adding data provenance support to Apache Spark

Interlandi, Matteo; Ekmekji, Ari; Shah, Kshitij; Gulzar, Muhammad Ali; Tetali, Sai Deep; Kim, Miryung; Millstein, Todd; Condie, Tyson

doi:10.1007/s00778-017-0474-5

Cited by 58 publications

(87 citation statements)

References 30 publications

(49 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This framework integrates the different schema matching and ontology alignment techniques for the purpose of information profiling. Metadata annotation can be efficient and does not heavily affect processing times of datasets in the DL as shown in related experiments like [13], [14] and in our experiments in Section VI.…”

Section: A Framework For Content Metadata Managementmentioning

confidence: 81%

“…Currently, data profiling and annotation is of great importance for research in DL architectures and is currently a hot topic for research [3], [12], [13]. Some techniques and approaches were previously investigated, but are mainly focused on relational content metadata [7], [10], free-text metadata [13], or data provenance metadata [1], [14]. Most of the current research efforts are suggesting the need for a governed metadata management process for integrating different varieties of BD [8], [13], [15].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Towards Information Profiling: Data Lake Content Metadata Management

Alserafi

Abelló

Romero

et al. 2016

2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)

View full text Add to dashboard Cite

Abstract-There is currently a burst of Big Data (BD) processed and stored in huge raw data repositories, commonly called Data Lakes (DL). These BD require new techniques of data integration and schema alignment in order to make the data usable by its consumers and to discover the relationships linking their content. This can be provided by metadata services which discover and describe their content. However, there is currently a lack of a systematic approach for such kind of metadata discovery and management. Thus, we propose a framework for the profiling of informational content stored in the DL, which we call information profiling. The profiles are stored as metadata to support data analysis. We formally define a metadata management process which identifies the key activities required to effectively handle this. We demonstrate the alternative techniques and performance of our process using a prototype implementation handling a real-life case-study from the OpenML DL, which showcases the value and feasibility of our approach. I. INTRODUCTIONThere is currently a huge growth in the amount, variety, and velocity of data ingested in analytical data repositories. Such data are commonly called Big Data (BD). Data repositories storing such BD in their original raw-format are commonly called Data Lakes (DL) [1]. DL are characterised by having a large amount of data covering different subjects, which need to be analysed by non-experts in IT commonly called data enthusiasts [2]. To support the data enthusiast in analysing the data in the DL, there must be a data governance process which describes the content using metadata. Such process should describe the informational content of the data ingested using the least intrusive techniques. The metadata can then be exploited by the data enthusiast to discover relationships between datasets, duplicated data, and outliers which have no other datasets related to them.In this paper, we investigate the appropriate process and techniques required to manage the metadata about the informational content of the DL. We specifically focus on addressing the challenges of variety and variability of BD ingested in the DL. The metadata discovered supports data consumers in finding the required data in the large amounts of information stored inside the DL for analytical purposes [3]. Currently, information discovery to identify, locate, integrate and reengineer data consumes 70% of time spent in data analytics project [1], which clearly needs to be decreased. To handle this challenge, this paper proposes (i) a systematic process for the schema annotation of data ingested in the DL and

show abstract

Section: A Framework For Content Metadata Managementmentioning

confidence: 81%

Section: Related Workmentioning

confidence: 99%

Towards Information Profiling: Data Lake Content Metadata Management

Alserafi

Abelló

Romero

et al. 2016

2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)

View full text Add to dashboard Cite

show abstract

“…A major drawback of RAMP and Newt is that they do not provide access to the intermediate data of the computation (in contrast to [1] that offers this functionality); consequently, these two systems cannot provide the How provenance of an output record. Based on this limitation, Titian [27] made some nice progress in extending Spark [45] with step-by-step provenance tracking. Titian materializes the dependencies between individual records in a Spark job (including the intermediate ones), and offers an API for interactive forward and backward tracing of dependencies.…”

Section: Related Workmentioning

confidence: 99%

“…Datalog-based Native Operator Level According to DTaP* [46] (NDlog Engine on ns-3) (NDlog) Theorem 1 1 The current version of Titian does not support iteration through GraphX [44]. 2 Newt has been applied to Hyracks and Hadoop [31], and to Spark [27], all of which support DAG dataflows. * These systems are not general-purpose data processing systems but they offer interesting features regarding provenance management.…”

Section: Related Workmentioning

confidence: 99%

“…Suppressing records ensures that each operator is correctly re-simulated, but does not solve the issue that the computation's input does not yield the desired output, which hinders the logical debugging of the program. In fact, all provenance-aware systems [26,27,4,31,46,29,1] provide explanations that guarantee the reproduction of the output only when the conditions of Theorem 1 are satisfied.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Explaining outputs in modern data analytics

et al. 2016

View full text Add to dashboard Cite

We report on the design and implementation of a general framework for interactively explaining the outputs of modern data-parallel computations, including iterative data analytics. To produce explanations, existing works adopt a naive backward tracing approach which runs into known issues; naive backward tracing may identify: (i) too much information that is difficult to process, and (ii) not enough information to reproduce the output, which hinders the logical debugging of the program. The contribution of this work is twofold. First, we provide methods to effectively reduce the size of explanations based on the first occurrence of a record in an iterative computation. Second, we provide a general method for identifying explanations that are sufficient to reproduce the target output in arbitrary computations -- a problem for which no viable solution existed until now. We implement our approach on differential dataflow , a modern high-throughput, low-latency dataflow platform. We add a small (but extensible) set of rules to explain each of its data-parallel operators, and we implement these rules as differential dataflow operators themselves. This choice allows our implementation to inherit the performance characteristics of differential dataflow, and results in a system that efficiently computes and updates explanatory inputs even as the inputs of the reference computation change. We evaluate our system with various analytic tasks on real datasets, and we show that it produces concise explanations in tens of milliseconds, while remaining faster -- up to two orders of magnitude -- than even the best implementations that do not support explanations.

show abstract

Roaring bitmaps: Implementation of an optimized software library

Lemire

Kaser

Kurz³

et al. 2018

Softw Pract Exp

View full text Add to dashboard Cite

Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix's Atlas, LinkedIn's Pivot, Metamarkets' Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Services, and Apache Kylin rely on a specific type of compressed bitmap index called Roaring. We present an optimized software library written in C implementing Roaring bitmaps: CRoaring. It benefits from several algorithms designed for the single-instruction-multiple-data instructions available on commodity processors. In particular, we present vectorized algorithms to compute the intersection, union, difference, and symmetric difference between arrays. We benchmark the library against a wide range of competitive alternatives, identifying weaknesses and strengths in our software. Our work is available under a liberal open-source license.• We present several nontrivial algorithmic optimizations (see Table 1). In particular, we show that a collection of algorithms exploiting SIMD instructions can enhance the performance of a data structure like Roaring in some cases, above and beyond what state-of-the-art optimizing compilers can achieve. To our knowledge, it is the first work to report on the benefits of advanced SIMD-based algorithms for compressed bitmaps.Although the approach we use to compute array intersections using SIMD instructions in Section 4.2 is not new, 22,23 our work on the computation of the union (Section 4.3), difference (Section 4.4), and symmetric difference (Section 4.4) of arrays using SIMD instructions might be novel and of general interest.• We benchmark our C library against a wide range of alternatives in C and C++. Our results provide guidance as to the strengths and weaknesses of our implementation.We focus primarily on our novel implementation and the lessons we learned: we refer to earlier work for details regarding the high-level algorithmic design of Roaring bitmaps. 18,19 Because our library is freely available under a liberal open-source license, we hope that our work will be used to accelerate information systems.

show abstract

Adding data provenance support to Apache Spark

Cited by 58 publications

References 30 publications

Towards Information Profiling: Data Lake Content Metadata Management

Towards Information Profiling: Data Lake Content Metadata Management

Explaining outputs in modern data analytics

Roaring bitmaps: Implementation of an optimized software library

Contact Info

Product

Resources

About