Detecting data errors

Abedjan, Ziawasch; Chu, Xu; Deng, Dong; Fernandez, Raul Castro; Ilyas, Ihab F.; Ouzzani, Mourad; Papotti, Paolo; Stonebraker, Michael; Tang, Nan

doi:10.14778/2994509.2994518

Cited by 155 publications

(19 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, the proportion of erroneous categorical values can be estimated in the same way as we estimate the proportions of inconsistent values in numerical data. Furthermore, RSP-Explore can be extended directly to support logical data cleaning tasks such as those discussed in [15,54]. A block-level sample can be used to estimate the proportion of records that don't satisfy a certain constraint or the proportion of values that are slightly different from the correct value.…”

Section: Discussionmentioning

confidence: 99%

“…

Introduction

MotivationSampling-based approaches have been adopted to alleviate the burden of big data volume not only when approximate results are useful as exact ones [1][2][3][4][5], but also when the results from a small clean sample can be more accurate than those from the entire dirty data [6][7][8][9]. It is a common practice to iteratively generate small random samples of a big data set to explore the statistical properties of the entire data and define cleaning rules [10][11][12][13][14][15][16][17][18][19]. This iterative process becomes impractical or impossible on small computing clusters due to the communication, I/O and memory costs of cluster computing frameworks that implement a shared-nothing architecture [20][21][22].

…”

mentioning

confidence: 99%

See 1 more Smart Citation

Exploring and cleaning big data with random sample data blocks

Salloum

Huang

2019

J Big Data

View full text Add to dashboard Cite

Introduction MotivationSampling-based approaches have been adopted to alleviate the burden of big data volume not only when approximate results are useful as exact ones [1][2][3][4][5], but also when the results from a small clean sample can be more accurate than those from the entire dirty data [6][7][8][9]. It is a common practice to iteratively generate small random samples of a big data set to explore the statistical properties of the entire data and define cleaning rules [10][11][12][13][14][15][16][17][18][19]. This iterative process becomes impractical or impossible on small computing clusters due to the communication, I/O and memory costs of cluster computing frameworks that implement a shared-nothing architecture [20][21][22]. While these distributed frameworks have not adapted well to the requirements of data exploration tasks, existing sequential techniques don't scale easily to big data [23]. In fact, there are plenty of data exploration and analysis libraries in common data science languages, e.g., R and Python [24,25]. To scale these libraries to big data on computing clusters, new distributed implementations are required to process distributed data. Even with distributed algorithms, the memory of the computing cluster may not be enough to hold the entire Abstract Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.

show abstract

Section: Discussionmentioning

confidence: 99%

“…

Introduction

…”

mentioning

confidence: 99%

Exploring and cleaning big data with random sample data blocks

Salloum

Huang

2019

J Big Data

View full text Add to dashboard Cite

show abstract

“… Using the right tools has a direct impact on the performance of the adopted CRM (Alshawi et al, 2011). As different types of errors can exist in the same data set, we often need to implement more than one error detection tool (Abedjan et al, 2016). Missi et al, (2005) cite a variety of tools that can be used to achieve data quality and integration: tools that provide insight into one relational access to data, tools that transform non-relational into relational data; tools that develop, test, and perform transformations in databases and automatically generate code that makes it easy to manage even the most complex transformations of all types of data and applications; tools for converting data among hundreds of formats and applications; tools for consolidation, verification, standardization, real-time data profiling; a tool that records, models, and maintains metadata from various sources, stores numerous models and versions.…”

Section: Solutionsmentioning

confidence: 99%

Data quality in customer relationship management (CRM): Literature review

Petrović¹

2020

Strategic Management

View full text Add to dashboard Cite

The aim of this paper is to examine challenges that organizations face when they start to deal with quality of customer data more seriously in order to manage their customer relationships better. Research extracted from the literature review has identified some problems with the quality of customer data as well as suggestions for their solutions. The author found that challenges regarding the quality of data used in customer relationship management are reflected in: decentralized data storage, inconsistencies in input and storage, inadequate integration of different data sources, different data defects, and their tendency in quality deterioration over time. In addition, problems have been identified in the high costs of maintaining data quality, as well as new challenges in the form of big data and open data. Possible improvement solutions have been suggested through a number of tools and frameworks by different authors

show abstract

“…Entity consolidation (4) is the process of merging all data about the same entity coherently (e.g., Hogan et al, 2012). An orthogonal but crucial component of the DI process is data cleansing (5), which can be applied to both the original data and the merged dataset (Abedjan et al, 2016).…”

Section: Data Integrationmentioning

confidence: 99%

Artificial intelligence for ocean science data integration: current state, gaps, and way forward

Sagi

Lehahn²,

Bar

2020

Elem Sci Anth

View full text Add to dashboard Cite

The study of the ocean is one of the biggest scientific challenges of the 21st century. It has a direct impact on our understanding of Earth's climate (Stocker et al., 2013) and biogeochemical cycling (Field et al., 1998), as well as on our ability to provide human society with food, chemicals, and energy (Lehahn et al., 2016). Oceanographic research strongly relies on in-situ and remotely-sensed observations, which describe physical, chemical, and biological seawater properties at a given time and place. These observations are collected from various crewed and autonomous platforms, including research vessels, floats (Roemmich et al., 2009), drifters (Lumpkin et al., 2017), autonomous vehicles (Eriksen et al., 2001), and satellites (Lehahn et al., 2018), providing an abundance of interdisciplinary information on processes occurring over a wide range of spatial (from micrometers to thousands of kilometers) and temporal (from seconds to decades) scales. Over the last century, numerous in-situ and remotelysensed measurements have been taken, resulting in the creation of an increasingly large amount of oceanic data. In recent years, with the enhanced utilization of satellites and autonomous observation platforms, these data are collected at a blistering rate. Improving the scientific community's ability to integrate, share, and explore this vast amount of data is an urgent task that will contribute substantially to our understanding of the ocean and its role in the Earth system. Several public data repositories have emerged to enable the archiving and sharing of data collected between researchers. For example, PANGEA (2020), a data repository for publishing and distributing georeferenced data

show abstract

Detecting data errors

Cited by 155 publications

References 33 publications

Exploring and cleaning big data with random sample data blocks

Exploring and cleaning big data with random sample data blocks

Data quality in customer relationship management (CRM): Literature review

Artificial intelligence for ocean science data integration: current state, gaps, and way forward

Contact Info

Product

Resources

About