Sebastian Kruse scite author profile

Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Realworld dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery. To this end, we propose the novel and highly efficient algorithm Pyro to discover both approximate FDs and approximate UCCs. Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, Pyro outperforms existing discovery algorithms by a factor of up to 33, scales to larger datasets, and at the same time requires the least main memory.

show abstract

Efficient denial constraint discovery with hydra

Bleifuß

Kruse

Naumann

2017

Proc. VLDB Endow.

View full text Add to dashboard Cite

Denial constraints (DCs) are a generalization of many other integrity constraints (ICs) widely used in databases, such as key constraints, functional dependencies, or order dependencies. Therefore, they can serve as a unified reasoning framework for all of these ICs and express business rules that cannot be expressed by the more restrictive IC types. The process of formulating DCs by hand is difficult, because it requires not only domain expertise but also database knowledge, and due to DCs' inherent complexity, this process is tedious and error-prone. Hence, an automatic DC discovery is highly desirable: we search for all valid denial constraints in a given database instance. However, due to the large search space, the problem of DC discovery is computationally expensive. We propose a new algorithm Hydra, which overcomes the quadratic runtime complexity in the number of tuples of state-of-the-art DC discovery methods. The new algorithm's experimentally determined runtime grows only linearly in the number of tuples. This results in a speedup by orders of magnitude, especially for datasets with a large number of tuples. Hydra can deliver results in a matter of seconds that to date took hours to compute.

show abstract

Divide & conquer-based inclusion dependency discovery

et al. 2015

View full text Add to dashboard Cite

The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets-an important property on the face of the ever increasing size of today's data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind.

show abstract

Approximate Discovery of Functional Dependencies for Large Datasets

Bleifuß

Bülow

Frohnhofen

et al. 2016

View full text Add to dashboard Cite

RHEEM: enabling cross-platform data processing

et al. 2018

View full text Add to dashboard Cite

Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) a robust interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.

show abstract

RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems

et al. 2020

View full text Add to dashboard Cite

Data analytics are moving beyond the limits of a single platform. In this paper, we present the cost-based optimizer of Rheem, an open-source cross-platform system that copes with these new requirements. The optimizer allocates the subtasks of data analytic tasks to the most suitable platforms. Our main contributions are: (i) a mechanism based on graph transformations to explore alternative execution strategies; (ii) a novel graph-based approach to determine efficient data movement plans among subtasks and platforms; and (iii) an efficient plan enumeration algorithm, based on a novel enumeration algebra. We extensively evaluate our optimizer under diverse real tasks. We show that our optimizer can perform tasks more than one order of magnitude faster when using multiple platforms than when using a single platform. Keywords Cross-platform • Polystore • Query optimization • Data processing 1 Hereafter, we use the term task without loss of generality.

show abstract

Automating Data Exchange in Process Choreographies

Meyer

Pufahl

Batoulis

et al. 2014

View full text Add to dashboard Cite

Establishment and Maintenance of Open Ribosomal RNA Gene Chromatin States in Eukaryotes

Schächner

Merkl

Pilsl

et al. 2022

View full text Add to dashboard Cite

In growing eukaryotic cells, nuclear ribosomal (r)RNA synthesis by RNA polymerase (RNAP) I accounts for the vast majority of cellular transcription. This high output is achieved by the presence of multiple copies of rRNA genes in eukaryotic genomes transcribed at a high rate. In contrast to most of the other transcribed genomic loci, actively transcribed rRNA genes are largely devoid of nucleosomes adapting a characteristic “open” chromatin state, whereas a significant fraction of rRNA genes resides in a transcriptionally inactive nucleosomal “closed” chromatin state. Here, we review our current knowledge about the nature of open rRNA gene chromatin and discuss how this state may be established.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Sebastian Kruse

Efficient discovery of approximate dependencies

Efficient denial constraint discovery with hydra

Divide & conquer-based inclusion dependency discovery

Approximate Discovery of Functional Dependencies for Large Datasets

RHEEM: enabling cross-platform data processing

RHEEMix in the data jungle: a cost-based optimizer for cross-platform systems

Automating Data Exchange in Process Choreographies

Establishment and Maintenance of Open Ribosomal RNA Gene Chromatin States in Eukaryotes

Contact Info

Product

Resources

About