Towards reliable interactive data cleaning

Krishnan, Sanjay; Haas, David J.; Franklin, Michael J.; Wu, Eugene

doi:10.1145/2939502.2939511

Cited by 55 publications

(43 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…; additionally, each of these tasks encapsulates several specialized algorithms such as machine learning, clustering or rule based procedures. In this research, it is proposed the idea of making an easy and fully automated data cleaning process (Krishnan, Haas, Franklin and Wu, 2016).…”

Section: State Of the Artmentioning

confidence: 99%

See 1 more Smart Citation

Data Cleaning Technique for Security Big Data Ecosystem

Martinez-Mosquera

Luján-Mora

2017

Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security

View full text Add to dashboard Cite

The information networks growth have given rise to an ever-multiplying number of security threats; it is the reason some information networks currently have incorporated a Computer Security Incident Response Team (CSIRT) responsible for monitoring all the events that occur in the network, especially those affecting data security. We can imagine thousands or even millions of events occurring every day and handling such amount of information requires a robust infrastructure. Commercially, there are many available solutions to process this kind of information, however, they are either expensive, or cannot cope with such volume. Furthermore, and most importantly, security information is by nature confidential and sensitive thus, companies should opt to process it internally. Taking as case study a university's CSIRT responsible for 10,000 users, we propose a security Big Data ecosystem to process a high data volume and guarantee the confidentiality. It was noted during implementation that one of the first challenges was the cleaning phase after data extraction, where it was observed that some data could be safely ignored without affecting result's quality, and thus reducing storage size requirements. For this cleaning phase, we propose an intuitive technique and a comparative proposal based on the Fellegi-Sunter theory.

show abstract

Section: State Of the Artmentioning

confidence: 99%

“…For instance, Scalability since the code lines for data cleaning must be created in house according the company's requirements, automatic Recoverability due to the need of the human intervention to restore the data cleaning script. Hence, we need to avoid the human intervention in the process (Krishnan et al, 2016).…”

Section: Intuitive Proposal For Data Cleaningmentioning

confidence: 99%

Data Cleaning Technique for Security Big Data Ecosystem

Martinez-Mosquera

Luján-Mora

2017

Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security

View full text Add to dashboard Cite

show abstract

“…This is referred to as editable shared representations between computers and humans [ 26 ]. Examples include natural language interfaces and form-based input [ 27 ]. Finally, domain experts are highly trained individuals, which allows systems to accelerate their input by using domain-specific assumptions and ontologies [ 28 , 29 ].…”

Section: Introductionmentioning

confidence: 99%

“…In contrast, tools at earlier pipeline stages have been designed mainly for data scientists and not for experts. However, domain experts are involved at every stage of the pipeline [ 27 - 31 ], especially in clinical research settings where data sets contain specialized information. Thus, there is a need to amplify domain expertise throughout the pipeline.…”

Section: Introductionmentioning

confidence: 99%

Amplifying Domain Expertise in Clinical Data Pipelines

2020

View full text Add to dashboard Cite

Digitization of health records has allowed the health care domain to adopt data-driven algorithms for decision support. There are multiple people involved in this process: a data engineer who processes and restructures the data, a data scientist who develops statistical models, and a domain expert who informs the design of the data pipeline and consumes its results for decision support. Although there are multiple data interaction tools for data scientists, few exist to allow domain experts to interact with data meaningfully. Designing systems for domain experts requires careful thought because they have different needs and characteristics from other end users. There should be an increased emphasis on the system to optimize the experts’ interaction by directing them to high-impact data tasks and reducing the total task completion time. We refer to this optimization as amplifying domain expertise. Although there is active research in making machine learning models more explainable and usable, it focuses on the final outputs of the model. However, in the clinical domain, expert involvement is needed at every pipeline step: curation, cleaning, and analysis. To this end, we review literature from the database, human-computer information, and visualization communities to demonstrate the challenges and solutions at each of the data pipeline stages. Next, we present a taxonomy of expertise amplification, which can be applied when building systems for domain experts. This includes summarization, guidance, interaction, and acceleration. Finally, we demonstrate the use of our taxonomy with a case study.

show abstract

“…

Introduction

MotivationSampling-based approaches have been adopted to alleviate the burden of big data volume not only when approximate results are useful as exact ones [1][2][3][4][5], but also when the results from a small clean sample can be more accurate than those from the entire dirty data [6][7][8][9]. It is a common practice to iteratively generate small random samples of a big data set to explore the statistical properties of the entire data and define cleaning rules [10][11][12][13][14][15][16][17][18][19]. This iterative process becomes impractical or impossible on small computing clusters due to the communication, I/O and memory costs of cluster computing frameworks that implement a shared-nothing architecture [20][21][22].

…”

mentioning

confidence: 99%

Exploring and cleaning big data with random sample data blocks

Salloum

Huang

2019

J Big Data

View full text Add to dashboard Cite

Introduction MotivationSampling-based approaches have been adopted to alleviate the burden of big data volume not only when approximate results are useful as exact ones [1][2][3][4][5], but also when the results from a small clean sample can be more accurate than those from the entire dirty data [6][7][8][9]. It is a common practice to iteratively generate small random samples of a big data set to explore the statistical properties of the entire data and define cleaning rules [10][11][12][13][14][15][16][17][18][19]. This iterative process becomes impractical or impossible on small computing clusters due to the communication, I/O and memory costs of cluster computing frameworks that implement a shared-nothing architecture [20][21][22]. While these distributed frameworks have not adapted well to the requirements of data exploration tasks, existing sequential techniques don't scale easily to big data [23]. In fact, there are plenty of data exploration and analysis libraries in common data science languages, e.g., R and Python [24,25]. To scale these libraries to big data on computing clusters, new distributed implementations are required to process distributed data. Even with distributed algorithms, the memory of the computing cluster may not be enough to hold the entire Abstract Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.

show abstract

Towards reliable interactive data cleaning

Cited by 55 publications

References 23 publications

Data Cleaning Technique for Security Big Data Ecosystem

Data Cleaning Technique for Security Big Data Ecosystem

Amplifying Domain Expertise in Clinical Data Pipelines

Exploring and cleaning big data with random sample data blocks

Contact Info

Product

Resources

About