Overview and Importance of Data Quality for Machine Learning Tasks

Jain, Abhinav; Patel, Hima; Nagalapatti, Lokesh; Gupta, Nitin; Mehta, Sameep; Guttula, Shanmukha; Mujumdar, Shashank; Afzal, Shazia; Mittal, Ruhi Sharma; Munigala, Vitobha

doi:10.1145/3394486.3406477

Cited by 129 publications

(60 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such good results are due mainly to the good quality of entry data that was manually created and configured by expert linguists. Our methodology confirms the importance of data quality over quantity for ML applications [70].…”

Section: Input Processingsupporting

confidence: 71%

Diabetes and conversational agents: the AIDA project case study

Alloatti

Bosca²,

et al. 2021

Discov Artif Intell

View full text Add to dashboard Cite

One of the key aspects in the process of caring for people with diabetes is Therapeutic Education (TE). TE is a teaching process for training patients so that they can self-manage their care plan. Alongside traditional methods of providing educational content, there are now alternative forms of delivery thanks to the implementation of advanced Information Technologies systems such as conversational agents (CAs). In this context, we present the AIDA project: an ensemble of two different CAs intended to provide a TE tool for people with diabetes. The Artificial Intelligence Diabetes Assistant (AIDA) consists of a text-based chatbot and a speech-based dialog system. Their content has been created and validated by a scientific board. AIDA Chatbot—the text-based agent—provides a broad spectrum of information about diabetes, while AIDA Cookbot—the voice-based agent—presents recipes compliant with a diabetic patient’s diet. We provide a thorough description of the development process for both agents, the technology employed and their usage by the general public. AIDA Chatbot and AIDA Cookbot are freely available and they represent the first example of conversational agents in Italian to support diabetes patients, clinicians and caregivers.

show abstract

Section: Input Processingsupporting

confidence: 71%

Diabetes and conversational agents: the AIDA project case study

Alloatti

Bosca²,

et al. 2021

Discov Artif Intell

View full text Add to dashboard Cite

show abstract

“…[5]. For example, it may include details of how a dataset fares across certain pre-defined quality metrics known to influence model building efforts [12]. The specific format of visualizing this information may vary depending on output requirements and constraints.…”

Section: Baseline Data Quality and Readiness Analysismentioning

confidence: 99%

“…These mainly contain the key sections covered in the data readiness report template described in Figure 3. We made use of some of the machine learning related quality metrics mentioned in [12] for illustration. Due to space constraints, we include only limited features just to exemplify how key information from the quality analysis process can be represented in the report.…”

Section: Referencesmentioning

confidence: 99%

Data Readiness Report

Afzal,

Kesarwani

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Data exploration and quality analysis is an important yet tedious process in the AI pipeline. Current practices of data cleaning and data readiness assessment for machine learning tasks are mostly conducted in an arbitrary manner which limits their reuse and results in loss of productivity. We introduce the concept of a Data Readiness Report as an accompanying documentation to a dataset that allows data consumers to get detailed insights into the quality of input data. Data characteristics and challenges on various quality dimensions are identified and documented keeping in mind the principles of transparency and explainability. The Data Readiness Report also serves as a record of all data assessment operations including applied transformations. This provides a detailed lineage for the purpose of data governance and management. In effect, the report captures and documents the actions taken by various personas in a data readiness and assessment workflow. Overtime this becomes a repository of best practices and can potentially drive a recommendation system for building automated data readiness workflows on the lines of AutoML [8]. We anticipate that together with the Datasheets [9], Dataset Nutrition Label [11], FactSheets [1] and Model Cards [15], the Data Readiness Report makes significant progress towards Data and AI lifecycle documentation. CCS CONCEPTS• General and reference → Evaluation; • Software and its engineering → Documentation; • Human-centered computing → Walkthrough evaluations.

show abstract

“…While other topics are actively researched in the ML literature, such as the improvement of data quality [6][7][8] or the development of better-performing models [9][10][11], the metrics used to evaluate these predictive pipelines have taken a relatively minimal place in this field. Caruana and Niculescu-Mizil [12] provided one of the earliest comprehensive works on the topic, presenting nine performance metrics for binary classification, which they divided into three groups: threshold metrics, ordering/rank metrics, and probability metrics.…”

Section: Introductionmentioning

confidence: 99%

Exploratory Analysis on Pixelwise Image Segmentation Metrics with an Application in Proximal Sensing

et al. 2022

View full text Add to dashboard Cite

A considerable number of metrics can be used to evaluate the performance of machine learning algorithms. While much work is dedicated to the study and improvement of data quality and models’ performance, much less research is focused on the study of these evaluation metrics, their intrinsic relationship, the interplay of the influence among the metrics, the models, the data, and the environments and conditions in which they are to be applied. While some works have been conducted on general machine learning tasks such as classification, fewer efforts have been dedicated to more complex problems such as object detection and image segmentation, in which the evaluation of performance can vary drastically depending on the objectives and domains of application. Working in an agricultural context, specifically on the problem of the automatic detection of plants in proximal sensing images, we studied twelve evaluation metrics that we used to evaluate three image segmentation models recently presented in the literature. After a unified presentation of these metrics, we carried out an exploratory analysis of their relationships using a correlation analysis, a clustering of variables, and two factorial analyses (namely principal component analysis and multiple factorial analysis). We distinguished three groups of highly linked metrics and, through visual inspection of the representative images of each group, identified the aspects of segmentation that each group evaluates. The aim of this exploratory analysis was to provide some clues to practitioners for understanding and choosing the metrics that are most relevant to their agricultural task.

show abstract

Overview and Importance of Data Quality for Machine Learning Tasks

Cited by 129 publications

References 8 publications

Diabetes and conversational agents: the AIDA project case study

Diabetes and conversational agents: the AIDA project case study

Data Readiness Report

Exploratory Analysis on Pixelwise Image Segmentation Metrics with an Application in Proximal Sensing

Contact Info

Product

Resources

About