Data-processing tasks are commonly managed using data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output data.In data-oriented workflows, it can be useful to track data provenance (also sometimes called lineage), which describes where data came from and how it has been manipulated and combined.We begin by giving a new general definition of provenance, introducing the notions of correctness, precision, and minimality. We then: 1) Describe a wrapper-based approach for capturing provenance in workflows in which all transformations are either map or reduce functions 2) Describe a provenance-based approach for selectively refreshing one or more elements in the output data, i.e., computing the latest values of particular output elements based on modified input data 3) Show how logical provenance, i.e., provenance information stored at the transformation level, can often capture precise provenance relationships in a compact fashion 4) Describe our prototype system called Panda (for Provenance And Data) that supports refresh in data-oriented workflows, as well as debugging and drill-down using logical provenance Overall, our work provides a comprehensive foundation, set of algorithms, and prototype system for provenance in data-oriented workflows.
CourseRank is a course planning tool aimed at helping students at Stanford. Recommendations comprise an integral part of the system. However, implementing existing recommendation methods leads to fixed, pre-specified recommendations that cannot adapt to each particular student's changing requirements and do not help exploit the full extent of the available learning opportunities at the university. In this paper, we describe the concept of a flexible recommendation workflow, i.e., a high-level description of a parameterized process for computing recommendations. The input parameters of a flexible recommendation process comprise the "knobs" that control the final output and hence support flexible recommendations. We describe how flexible recommendations can be expressed over a relational database and we present our prototype system that allows defining and executing different, fully-parameterized, recommendation workflows over relational data. Finally, we describe a user interface in CourseRank that allows students to make use of two flexible recommendation workflows.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.