Data integration and machine learning

Dong, Xin Luna; Ρεκατσίνας, Θεόδωρος

doi:10.14778/3229863.3229876

Cited by 25 publications

(14 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Next to the elimination of existing errors in the input data, procedures for feature engineering, carried out by data scientists (in cooperation with domain experts), are necessary for other domain-specific ML operations. Data cleaning can be split into three parts [43], where error detection like duplicate data, violations of logical constraints, or incorrect value recognition is the first task. Moreover, solving every detected error is a second operation, and the data imputation supplements the missing and incomplete data as the last step.…”

Section: Data Cleaning and Labelingmentioning

confidence: 99%

Demystifying MLOps and Presenting a Recipe for the Selection of Open-Source Tools

et al. 2021

View full text Add to dashboard Cite

Nowadays, machine learning projects have become more and more relevant to various real-world use cases. The success of complex Neural Network models depends upon many factors, as the requirement for structured and machine learning-centric project development management arises. Due to the multitude of tools available for different operational phases, responsibilities and requirements become more and more unclear. In this work, Machine Learning Operations (MLOps) technologies and tools for every part of the overall project pipeline, as well as involved roles, are examined and clearly defined. With the focus on the inter-connectivity of specific tools and comparison by well-selected requirements of MLOps, model performance, input data, and system quality metrics are briefly discussed. By identifying aspects of machine learning, which can be reused from project to project, open-source tools which help in specific parts of the pipeline, and possible combinations, an overview of support in MLOps is given. Deep learning has revolutionized the field of Image processing, and building an automated machine learning workflow for object detection is of great interest for many organizations. For this, a simple MLOps workflow for object detection with images is portrayed.

show abstract

Section: Data Cleaning and Labelingmentioning

confidence: 99%

Demystifying MLOps and Presenting a Recipe for the Selection of Open-Source Tools

et al. 2021

View full text Add to dashboard Cite

show abstract

“…However, their increased usefulness depends on the widespread adoption of ontologies and metadata standards by data providers, a process that is still underway. A promising approach to overcome these limitations has been to use machine learning techniques to support open data integration activities (Dong & Rekatsinas, 2018;Miller, 2018), such as entity matching (Mudgal et al, 2018;Nargesian, Zhu, Pu, & Miller, 2018). These recently proposed techniques could be leveraged and extended for integrating biodiversity and other related datasets.…”

Section: Biodiversity Informatics Challenges and Concluding Remarksmentioning

confidence: 99%

A survey of biodiversity informatics: Concepts, practices, and challenges

Gadelha

Siracusa

Dalcin

et al. 2020

WIREs Data Min & Knowl

View full text Add to dashboard Cite

The unprecedented size of the human population, along with its associated economic activities, has an ever‐increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of ecosystems to provide resources. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision‐makers in ways that they can effectively use them. The development and deployment of tools and techniques to generate these indicators require having access to trustworthy data from biological collections, field surveys and automated sensors, molecular data, and historic academic literature. The transformation of these raw data into synthesized information that is fit for use requires going through many refinement steps. The methodologies and techniques applied to manage and analyze these data constitute an area usually called biodiversity informatics. Biodiversity data follow a life cycle consisting of planning, collection, certification, description, preservation, discovery, integration, and analysis. Researchers, whether producers or consumers of biodiversity data, will likely perform activities related to at least one of these steps. This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges. This article is categorized under: Algorithmic Development > Biological Data Mining

show abstract

“…For biomedical data sets, integration can involve standardization by mapping to ontologies with controlled vocabularies [ 43 - 45 ]. Although current approaches use deep learning for integration [ 46 - 50 ], generating a training corpus and validating results require domain expert input. For example, Cui et al [ 35 ] require domain experts to validate data curation efforts for studying sudden death in epilepsy.…”

Section: Challenges In the Data Pipelinementioning

confidence: 99%

Amplifying Domain Expertise in Clinical Data Pipelines

2020

View full text Add to dashboard Cite

Digitization of health records has allowed the health care domain to adopt data-driven algorithms for decision support. There are multiple people involved in this process: a data engineer who processes and restructures the data, a data scientist who develops statistical models, and a domain expert who informs the design of the data pipeline and consumes its results for decision support. Although there are multiple data interaction tools for data scientists, few exist to allow domain experts to interact with data meaningfully. Designing systems for domain experts requires careful thought because they have different needs and characteristics from other end users. There should be an increased emphasis on the system to optimize the experts’ interaction by directing them to high-impact data tasks and reducing the total task completion time. We refer to this optimization as amplifying domain expertise. Although there is active research in making machine learning models more explainable and usable, it focuses on the final outputs of the model. However, in the clinical domain, expert involvement is needed at every pipeline step: curation, cleaning, and analysis. To this end, we review literature from the database, human-computer information, and visualization communities to demonstrate the challenges and solutions at each of the data pipeline stages. Next, we present a taxonomy of expertise amplification, which can be applied when building systems for domain experts. This includes summarization, guidance, interaction, and acceleration. Finally, we demonstrate the use of our taxonomy with a case study.

show abstract

Data integration and machine learning

Cited by 25 publications

References 45 publications

Demystifying MLOps and Presenting a Recipe for the Selection of Open-Source Tools

Demystifying MLOps and Presenting a Recipe for the Selection of Open-Source Tools

A survey of biodiversity informatics: Concepts, practices, and challenges

Amplifying Domain Expertise in Clinical Data Pipelines

Contact Info

Product

Resources

About