2016
DOI: 10.14778/2994509.2994518
|View full text |Cite
|
Sign up to set email alerts
|

Detecting data errors

Abstract: Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 155 publications
(19 citation statements)
references
References 33 publications
0
17
0
Order By: Relevance
“…Similarly, the proportion of erroneous categorical values can be estimated in the same way as we estimate the proportions of inconsistent values in numerical data. Furthermore, RSP-Explore can be extended directly to support logical data cleaning tasks such as those discussed in [15,54]. A block-level sample can be used to estimate the proportion of records that don't satisfy a certain constraint or the proportion of values that are slightly different from the correct value.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Similarly, the proportion of erroneous categorical values can be estimated in the same way as we estimate the proportions of inconsistent values in numerical data. Furthermore, RSP-Explore can be extended directly to support logical data cleaning tasks such as those discussed in [15,54]. A block-level sample can be used to estimate the proportion of records that don't satisfy a certain constraint or the proportion of values that are slightly different from the correct value.…”
Section: Discussionmentioning
confidence: 99%
“…
Introduction
MotivationSampling-based approaches have been adopted to alleviate the burden of big data volume not only when approximate results are useful as exact ones [1][2][3][4][5], but also when the results from a small clean sample can be more accurate than those from the entire dirty data [6][7][8][9]. It is a common practice to iteratively generate small random samples of a big data set to explore the statistical properties of the entire data and define cleaning rules [10][11][12][13][14][15][16][17][18][19]. This iterative process becomes impractical or impossible on small computing clusters due to the communication, I/O and memory costs of cluster computing frameworks that implement a shared-nothing architecture [20][21][22].
…”
mentioning
confidence: 99%
“… Using the right tools has a direct impact on the performance of the adopted CRM (Alshawi et al, 2011). As different types of errors can exist in the same data set, we often need to implement more than one error detection tool (Abedjan et al, 2016). Missi et al, (2005) cite a variety of tools that can be used to achieve data quality and integration: tools that provide insight into one relational access to data, tools that transform non-relational into relational data; tools that develop, test, and perform transformations in databases and automatically generate code that makes it easy to manage even the most complex transformations of all types of data and applications; tools for converting data among hundreds of formats and applications; tools for consolidation, verification, standardization, real-time data profiling; a tool that records, models, and maintains metadata from various sources, stores numerous models and versions.…”
Section: Solutionsmentioning
confidence: 99%
“…Entity consolidation (4) is the process of merging all data about the same entity coherently (e.g., Hogan et al, 2012). An orthogonal but crucial component of the DI process is data cleansing (5), which can be applied to both the original data and the merged dataset (Abedjan et al, 2016).…”
Section: Data Integrationmentioning
confidence: 99%