“…Therefore, instead of using QCs, well-known methods are introduced from other omics areas, particularly genomics, that can remove batch effects from subject samples directly. 11 They can be classified into two main approaches: location-scale methods and matrix factorization methods. Location-scale methods assume a model for data distribution within a batch, and adjust the data within each batch to fit this model.…”
mentioning
confidence: 99%
“…Overfitting results from the small number of QCs for the training models and cannot be avoided. Therefore, instead of using QCs, well-known methods are introduced from other omics areas, particularly genomics, that can remove batch effects from subject samples directly . They can be classified into two main approaches: location-scale methods and matrix factorization methods.…”
Untargeted metabolomics based on liquid chromatography−mass spectrometry is affected by nonlinear batch effects, which cover up biological effects, result in nonreproducibility, and are difficult to be calibrate. In this study, we propose a novel deep learning model, called Normalization Autoencoder (NormAE), which is based on nonlinear autoencoders (AEs) and adversarial learning. An additional classifier and ranker are trained to provide adversarial regularization during the training of the AE model, latent representations are extracted by the encoder, and then the decoder reconstructs the data without batch effects. The NormAE method was tested on two real metabolomics data sets. After calibration by NormAE, the quality control samples (QCs) for both data sets gathered most closely in a PCA score plot (average distances decreased from 56.550 and 52.476 to 7.383 and 14.075, respectively) and obtained the highest average correlation coefficients (from 0.873 and 0.907 to 0.997 for both). Additionally, NormAE significantly improved biomarker discovery (median number of differential peaks increased from 322 and 466 to 1140 and 1622, respectively). NormAE was compared with four commonly used batch effect removal methods. The results demonstrated that using NormAE produces the best calibration results.
“…Therefore, instead of using QCs, well-known methods are introduced from other omics areas, particularly genomics, that can remove batch effects from subject samples directly. 11 They can be classified into two main approaches: location-scale methods and matrix factorization methods. Location-scale methods assume a model for data distribution within a batch, and adjust the data within each batch to fit this model.…”
mentioning
confidence: 99%
“…Overfitting results from the small number of QCs for the training models and cannot be avoided. Therefore, instead of using QCs, well-known methods are introduced from other omics areas, particularly genomics, that can remove batch effects from subject samples directly . They can be classified into two main approaches: location-scale methods and matrix factorization methods.…”
Untargeted metabolomics based on liquid chromatography−mass spectrometry is affected by nonlinear batch effects, which cover up biological effects, result in nonreproducibility, and are difficult to be calibrate. In this study, we propose a novel deep learning model, called Normalization Autoencoder (NormAE), which is based on nonlinear autoencoders (AEs) and adversarial learning. An additional classifier and ranker are trained to provide adversarial regularization during the training of the AE model, latent representations are extracted by the encoder, and then the decoder reconstructs the data without batch effects. The NormAE method was tested on two real metabolomics data sets. After calibration by NormAE, the quality control samples (QCs) for both data sets gathered most closely in a PCA score plot (average distances decreased from 56.550 and 52.476 to 7.383 and 14.075, respectively) and obtained the highest average correlation coefficients (from 0.873 and 0.907 to 0.997 for both). Additionally, NormAE significantly improved biomarker discovery (median number of differential peaks increased from 322 and 466 to 1140 and 1622, respectively). NormAE was compared with four commonly used batch effect removal methods. The results demonstrated that using NormAE produces the best calibration results.
As a powerful phenotyping technology, metabolomics provides new opportunities in biomarker discovery through metabolome-wide association studies (MWAS) and the identification of metabolites having a regulatory effect in various biological processes. While mass spectrometry-based (MS) metabolomics assays are endowed with high throughput and sensitivity, MWAS are doomed to long-term data acquisition generating an overtime-analytical signal drift that can hinder the uncovering of real biologically relevant changes. We developed “dbnorm”, a package in the R environment, which allows for an easy comparison of the model performance of advanced statistical tools commonly used in metabolomics to remove batch effects from large metabolomics datasets. “dbnorm” integrates advanced statistical tools to inspect the dataset structure not only at the macroscopic (sample batches) scale, but also at the microscopic (metabolic features) level. To compare the model performance on data correction, “dbnorm” assigns a score that help users identify the best fitting model for each dataset. In this study, we applied “dbnorm” to two large-scale metabolomics datasets as a proof of concept. We demonstrate that “dbnorm” allows for the accurate selection of the most appropriate statistical tool to efficiently remove the overtime signal drift and to focus on the relevant biological components of complex datasets.
Background
Each mother–child dyad represents a unique combination of genetic and environmental factors. This constellation of variables impacts the expression of countless genes. Numerous studies have uncovered changes in DNA methylation (DNAm), a form of epigenetic regulation, in offspring related to maternal risk factors. How these changes work together to link maternal-child risks to childhood cardiometabolic and neurocognitive traits remains unknown. This question is a key research priority as such traits predispose to future non-communicable diseases (NCDs). We propose viewing risk and the genome through a multidimensional lens to identify common DNAm patterns shared among diverse risk profiles.
Methods
We identified multifactorial Maternal Risk Profiles (MRPs) generated from population-based data (n = 15,454, Avon Longitudinal Study of Parents and Children (ALSPAC)). Using cord blood HumanMethylation450 BeadChip data, we identified genome-wide patterns of DNAm that co-vary with these MRPs. We tested the prospective relation of these DNAm patterns (n = 914) to future outcomes using decision tree analysis. We then tested the reproducibility of these patterns in (1) DNAm data at age 7 and 17 years within the same cohort (n = 973 and 974, respectively) and (2) cord DNAm in an independent cohort, the Generation R Study (n = 686).
Results
We identified twenty MRP-related DNAm patterns at birth in ALSPAC. Four were prospectively related to cardiometabolic and/or neurocognitive childhood outcomes. These patterns were replicated in DNAm data from blood collected at later ages. Three of these patterns were externally validated in cord DNAm data in Generation R. Compared to previous literature, DNAm patterns exhibited novel spatial distribution across the genome that intersects with chromatin functional and tissue-specific signatures.
Conclusions
To our knowledge, we are the first to leverage multifactorial population-wide data to detect patterns of variability in DNAm. This context-based approach decreases biases stemming from overreliance on specific samples or variables. We discovered molecular patterns demonstrating prospective and replicable relations to complex traits. Moreover, results suggest that patterns harbour a genome-wide organisation specific to chromatin regulation and target tissues. These preliminary findings warrant further investigation to better reflect the reality of human context in molecular studies of NCDs.
Graphical Abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.