Dataset integration is common practice to overcome limitations in statistically underpowered omics datasets. Proteome datasets display high technical variability and frequent missing values. Sophisticated strategies for batch effect reduction are lacking or rely on error-prone data imputation. Here we introduce HarmonizR, a data harmonization tool with appropriate missing value handling. The method exploits the structure of available data and matrix dissection for minimal data loss, without data imputation. This strategy implements two common batch effect reduction methods—ComBat and limma (removeBatchEffect()). The HarmonizR strategy, evaluated on four exemplarily analyzed datasets with up to 23 batches, demonstrated successful data harmonization for different tissue preservation techniques, LC-MS/MS instrumentation setups, and quantification approaches. Compared to data imputation methods, HarmonizR was more efficient and performed superior regarding the detection of significant proteins. HarmonizR is an efficient tool for missing data tolerant experimental variance reduction and is easily adjustable for individual dataset properties and user preferences.
Investigating the proteome can add a significant layer of information to manifold existing methylation, mutation, and transcriptome data on brain tumors as proteins represent the pharmacologically addressable phenotype of a disease. Small cohorts limit the usability and validity of statistical methods, and variable technical setups and high numbers of missing values make data integration from public sources challenging. Using a newly developed framework being able to reduce batch effects without the need for data reduction or missing value imputation, we show –based on in-house and publicly available datasets- successful integration of proteomic data across different tissue types, quantification platforms, and technical setups. Exemplarily, data of a Sonic hedgehog (Shh) medulloblastoma mouse model were analyzed, showing efficient data integration independent of tissue preservation strategy or batch. We further integrated batches of publicly available data of human brain tumors, confirming proposed proteomic cancer subtypes correlating with clinical features. We show that, missing value tolerant reduction of technical variances may be helpful to identify biomarkers, proteomic signatures, and altered pathways characteristic for molecular brain cancer subtypes.
Dataset integration is common practice to overcome limitations, e.g., in statistically underpowered omics datasets. This is of particular importance when analyzing rare tumor entities. However, combining datasets leads to the introduction of biases, so called 'batch effects', which are due to differences in quantification techniques, laboratory equipment or used tissue type. A common problem is the missing quantification for features like gene transcripts or proteins within a dataset. These missing values can appear at random in a given dataset and also get introduced by combination of multiple datasets. Currently, strategies beyond common normalization for batch effect reduction are either missing entirely or are unable to handle absence of data points and therefore rely on error-prone data imputation. We introduce a framework that enables batch effect adjustments for combined datasets while avoiding data loss by appropriately handling missing values without imputation. The underlying idea is based on a matrix dissection approach, adjusting common information from the integrated dataset under guarantee of sufficient data presence. The strategy is implemented within the R environment and linked with popular software stacks that are built on top of R. Successful data adjustment is exemplarily shown for proteomic data generated by different quantification approaches and LC-MS/MS instrumentation setups.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.