2020
DOI: 10.14778/3377369.3377377
|View full text |Cite
|
Sign up to set email alerts
|

Pattern functional dependencies for data cleaning

Abstract: Patterns (or regex-based expressions) are widely used to constrain the format of a domain (or a column), e.g., a Year column should contain only four digits, and thus a value like "1980-" might be a typo. Moreover, integrity constraints (ICs) defined over multiple columns, such as (conditional) functional dependencies and denial constraints, e.g., a ZIP code uniquely determines a city in the UK, have been widely used in data cleaning. However, a promising, but not yet explored, direction is to combine regex-an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 19 publications
(9 citation statements)
references
References 34 publications
0
8
0
Order By: Relevance
“…Alternatively, pattern mining approaches attempt to discover the syntactic and semantic characterizations of the data. One technique for pattern discovery is inducing functional dependencies from the data [30][31][32]. Functional dependencies are considered a special form of denial constraints [30] and are commonly used to specify business rules.…”
Section: Error Detectionmentioning
confidence: 99%
See 1 more Smart Citation
“…Alternatively, pattern mining approaches attempt to discover the syntactic and semantic characterizations of the data. One technique for pattern discovery is inducing functional dependencies from the data [30][31][32]. Functional dependencies are considered a special form of denial constraints [30] and are commonly used to specify business rules.…”
Section: Error Detectionmentioning
confidence: 99%
“…Existing research [31] has studied repeated patterns in the data and formalize them into functional dependencies to suggest better repair solutions. Another study [32] focuses on deriving such dependencies with the presence of erroneous data; the method [32] introduces a new class of integrity constraints that can infer dependencies between data attributes even if a portion of the attributes violates these dependencies.…”
Section: Error Detectionmentioning
confidence: 99%
“…(i) Only examine the data at hand. There are integrity constraints (FDs [6], its extensions CFDs [26] and PFDs [54], denial constraints [16], and rule-based methods [33,66]), and probabilistic based methods (e.g., HoloClean [57]). They need enough signals or data redundancy from D. Supervised ML based methods (e.g., GDR [70], SCAREd [69], Raha [49] and Baran [48]) learn only from the data at hand, which cannot be generalized to other datasets.…”
Section: Related Workmentioning
confidence: 99%
“…This type of techniques cleans 𝐷 by only examining 𝐷, where predefined domain knowledge is often coded as rules. There are integrity constraints (FDs [10], its extensions conditional FDs [26] and pattern FDs [53], and denial constraints [16]), and probabilistic based methods (e.g., HoloClean [55]). They need enough signals or data redundancy from 𝐷.…”
Section: Rpt-c: Data Cleaning 21 the Data Cleaning Problem And Prior Artmentioning
confidence: 99%