MacroBase

Abuzaid, Firas; Bailis, Peter; Ding, Jialin; Gan, Edward; Madden, Samuel; Narayanan, Deepak; Rong, Kexin; Suri, Sahaana

doi:10.1145/3035918.3035928

Cited by 69 publications

(11 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sometimes, when Data X-Ray uses the instances generated by MLDebugger, it does better, at least in recall. 2 https://github.com/raonilourenco/MLDebugger This is expected for the case where the root causes are conjunctions of property-comparator-value triples since Data X-Ray was designed to find relevant conjunctions. That is, Data X-Ray produces a conjunction of property-value combinations that lead to bad scenarios.…”

Section: Resultsmentioning

confidence: 99%

“…For purposes of reproducibility and community use, we will make our code and experiments available. 2…”

Section: Ii1 Estimator Random Forestmentioning

confidence: 99%

“…Recently, the problem of explaining query results and interesting features in data has received substantial attention in the literature [2,10,13,21,24]. Some works have focused on explaining where and how errors occur in the data generation process [24] and which data items are more likely to be causes of relational query outputs [21,25].…”

Section: Related Workmentioning

confidence: 99%

“…Some works have focused on explaining where and how errors occur in the data generation process [24] and which data items are more likely to be causes of relational query outputs [21,25]. Others have attempted to use data to explain salient features in data (e.g., outliers) by discovering relationships between attribute values [2,10,13]. These approaches have either focused on using data, including provenance, to explain data or considered pipelines consisting of relational algebra operations.…”

Section: Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Debugging Machine Learning Pipelines

Lourenço

Freire

Shasha

2019

Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning

View full text Add to dashboard Cite

Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time consuming and error prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our source code and experimental data will be available for reproducibility and enhancement.

show abstract

Section: Resultsmentioning

confidence: 99%

“…For purposes of reproducibility and community use, we will make our code and experiments available. 2…”

Section: Ii1 Estimator Random Forestmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Debugging Machine Learning Pipelines

Lourenço

Freire

Shasha

2019

Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning

View full text Add to dashboard Cite

show abstract

“…Consequently, analyzing a big data set all at once may require more than the available resources in order to meet specific application requirements [8], [9]. Random sampling is a common strategy to alleviate these challenges [10], e.g., in approximate and incremental computing [8], [11]- [15]. However, drawing random samples from big data is itself an expensive operation [16] especially with the shared-nothing architectures in the mainstream distributed computing frameworks for big data analysis.…”

Section: Introductionmentioning

confidence: 99%

An Asymptotic Ensemble Learning Framework for Big Data Analysis

Salloum

Huang

et al. 2019

IEEE Access

View full text Add to dashboard Cite

In order to enable big data analysis when data volume goes beyond the available computing resources, we propose a new method for big data analysis. This method uses only a few random sample data blocks of a big data set to obtain approximate results for the entire data set. The random sample partition (RSP) distributed data model is used to represent a big data set as a set of non-overlapping random sample data blocks. Each block is saved as an RSP data block file that can be used directly to estimate the statistical properties of the entire data set. A subset of RSP data blocks is randomly selected and analyzed with existing sequential algorithms in parallel. Then, the results from these blocks are combined to obtain ensemble estimates and models which can be improved gradually by appending new results from the newly analyzed RSP data blocks. To this end, we propose a distributed data-parallel framework (Alpha framework) and develop a prototype of this framework using Microsoft R Server packages and Hadoop distributed file system. The experimental results of three real data sets show that a subset of RSP data blocks of a data set is sufficient to obtain estimates and models which are equivalent to those computed from the entire data set. INDEX TERMS Big data analysis, cluster computing, random sample partition, block-level sampling, distributed and parallel computing, approximate computing, random sampling, ensemble methods.

show abstract