2015
DOI: 10.1016/j.is.2015.04.002
|View full text |Cite
|
Sign up to set email alerts
|

SOFA: An extensible logical optimizer for UDF-heavy data flows

Abstract: a b s t r a c tRecent years have seen an increased interest in large-scale analytical data flows on nonrelational data. These data flows are compiled into execution graphs scheduled on large compute clusters. In many novel application areas the predominant building blocks of such data flows are user-defined predicates or functions (UDFs). However, the heavy use of UDFs is not well taken into account for data flow optimization in current systems.SOFA is a novel and extensible optimizer for UDF-heavy data flows.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0
2

Year Published

2016
2016
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 30 publications
(28 citation statements)
references
References 25 publications
0
26
0
2
Order By: Relevance
“…To this end, it examines only sub-flows in terms of meeting the dependency constraints and applies a set of recursive calls until generating all the promising data flow plans employing early pruning. Such an optimization technique has been applied in [21,40] for executing parallel scientific workflows efficiently, as part of a new optimization technique for the development of a logical optimizer, which is integrated into the Stratosphere system [62], the predecessor of Apache Flink. An interesting feature of this approach is that following common practice from database systems it performs static task analysis (i.e., task profiling) in order to yield statistics and fine-grained dependency constraints between tasks going further from the knowledge that can be derived from simply examining the task schemata.…”
Section: Techniques For Minimizing the Sum Of Costsmentioning
confidence: 99%
“…To this end, it examines only sub-flows in terms of meeting the dependency constraints and applies a set of recursive calls until generating all the promising data flow plans employing early pruning. Such an optimization technique has been applied in [21,40] for executing parallel scientific workflows efficiently, as part of a new optimization technique for the development of a logical optimizer, which is integrated into the Stratosphere system [62], the predecessor of Apache Flink. An interesting feature of this approach is that following common practice from database systems it performs static task analysis (i.e., task profiling) in order to yield statistics and fine-grained dependency constraints between tasks going further from the knowledge that can be derived from simply examining the task schemata.…”
Section: Techniques For Minimizing the Sum Of Costsmentioning
confidence: 99%
“…Crotty et al developed Tupleware, a cluster programming environment emphasizing code generation and stateful analytics, but the emphasis is on low-level programming idioms rather than marrying logical and physical abstractions [11]. Rheinlander et al proposed a logical optimizer for UDFcentric dataflows called SOFA [29]. SOFA emphasizes properties of UDFs to facilitate optimizations.…”
Section: Related Workmentioning
confidence: 99%
“…In fact, similar principles to those introduced in query optimization (i.e., generating semantically equivalent execution plans for a query by reordering operations, and then finding a plan with a minimal cost) have been applied in [83] and extended to the context of ETL flows. Another work [44,76] has based operation reordering (i.e., plan rewrites) on automatically discovering a set of extensible operation properties rather than relying solely on algebraic specifications, in order to enable reordering of complex ("black-box") operators. While low data latency is desirable for ETL processes, due to limited time windows dedicated to the DW refreshment processes, in the next generation BI setting, having data-intensive flows with close to zero latency is a must.…”
Section: Optimization Inputmentioning
confidence: 99%