Background Untargeted metabolomics datasets contain large proportions of uninformative features that can impede subsequent statistical analysis such as biomarker discovery and metabolic pathway analysis. Thus, there is a need for versatile and data-adaptive methods for filtering data prior to investigating the underlying biological phenomena. Here, we propose a data-adaptive pipeline for filtering metabolomics data that are generated by liquid chromatography-mass spectrometry (LC-MS) platforms. Our data-adaptive pipeline includes novel methods for filtering features based on blank samples, proportions of missing values, and estimated intra-class correlation coefficients. Results Using metabolomics datasets that were generated in our laboratory from samples of human blood, as well as two public LC-MS datasets, we compared our data-adaptive filtering method with traditional methods that rely on non-method specific thresholds. The data-adaptive approach outperformed traditional approaches in terms of removing noisy features and retaining high quality, biologically informative ones. The R code for running the data-adaptive filtering method is provided at https://github.com/courtneyschiffman/Metabolomics-Filtering . Conclusions Our proposed data-adaptive filtering pipeline is intuitive and effectively removes uninformative features from untargeted metabolomics datasets. It is particularly relevant for interrogation of biological phenomena in data derived from complex matrices associated with biospecimens. Electronic supplementary material The online version of this article (10.1186/s12859-019-2871-9) contains supplementary material, which is available to authorized users.
Introduction For pediatric diseases like childhood leukemia, a short latency period points to in-utero exposures as potentially important risk factors. Untargeted metabolomics of small molecules in archived newborn dried blood spots (DBS) offers an avenue for discovering early-life exposures that contribute to disease risks. Objectives The purpose of this study was to develop a quantitative method for untargeted analysis of archived newborn DBS for use in an epidemiological study (California Childhood Leukemia Study, CCLS). Methods Using experimental DBS from the blood of an adult volunteer, we optimized extraction of small molecules and integrated measurement of potassium as a proxy for blood hematocrit. We then applied this extraction method to 4.7-mm punches from 106 control DBS samples from the CCLS. Sample extracts were analyzed with liquid chromatography high resolution mass spectrometry (LC-HRMS) and an untargeted workflow was used to screen for metabolites that discriminate population characteristics such as sex, ethnicity, and birth weight. Results Thousands of small molecules were measured in extracts of archived DBS. Normalizing for potassium levels removed variability related to varying hematocrit across DBS punches. Of the roughly 1,000 prevalent small molecules that were tested, multivariate linear regression detected significant associations with ethnicity (3 metabolites) and birth weight (15 metabolites) after adjusting for multiple testing. Conclusions This untargeted workflow can be used for analysis of small molecules in archived DBS to discover novel biomarkers, to provide insights into the initiation and progression of diseases, and to provide guidance for disease prevention.
BackgroundPreviously, using microarrays and mRNA-Sequencing (mRNA-Seq) we found that occupational exposure to a range of benzene levels perturbed gene expression in peripheral blood mononuclear cells.ObjectivesIn the current study, we sought to identify gene expression biomarkers predictive of benzene exposure below 1 part per million (ppm), the occupational standard in the U.S.MethodsFirst, we used the nCounter platform to validate altered expression of 30 genes in 33 unexposed controls and 57 subjects exposed to benzene (<1 to ≥5 ppm). Second, we used SuperLearner (SL) to identify a minimal number of genes for which altered expression could predict <1 ppm benzene exposure, in 44 subjects with a mean air benzene level of 0.55±0.248 ppm (minimum 0.203ppm).ResultsnCounter and microarray expression levels were highly correlated (coefficients >0.7, p<0.05) for 26 microarray-selected genes. nCounter and mRNA-Seq levels were poorly correlated for 4 mRNA-Seq-selected genes. Using negative binomial regression with adjustment for covariates and multiple testing, we confirmed differential expression of 23 microarray-selected genes in the entire benzene-exposed group, and 27 genes in the <1 ppm-exposed subgroup, compared with the control group. Using SL, we identified 3 pairs of genes that could predict <1 ppm benzene exposure with cross-validated AUC estimates >0.9 (p<0.0001) and were not predictive of other exposures (nickel, arsenic, smoking, stress). The predictive gene pairs are PRG2/CLEC5A, NFKBI/CLEC5A, and ACSL1/CLEC5A. They play roles in innate immunity and inflammatory responses.ConclusionsUsing nCounter and SL, we validated the altered expression of multiple mRNAs by benzene and identified gene pairs predictive of exposure to benzene at levels below the US occupational standard of 1ppm.
Metabolism of chemicals from the diet, exposures to xenobiotics, the microbiome, and lifestyle factors (e.g., smoking, alcohol intake) produce electrophiles that react with nucleophilic sites in circulating proteins, notably Cys34 of human serum albumin (HSA). To discover potential risk factors resulting from in utero exposures, we are investigating HSA-Cys34 adducts in archived
Early-life exposures are believed to influence the incidence of pediatric acute lymphoblastic leukemia (ALL). Archived neonatal blood spots (NBS), collected within the first days of life, offer a means to investigate small molecules that reflect early-life exposures. Using untargeted metabolomics, we compared abundances of small-molecule features in extracts of NBS punches from 332 children that later developed ALL and 324 healthy controls. Subjects were stratified by early (1-5 y) and late (6-14 y) diagnosis. Mutually-exclusive sets of metabolic featuresrepresenting putative lipids and fatty acids-were associated with ALL, including 9 and 19 metabolites in the early-and late-diagnosis groups, respectively. In the late-diagnosis group, a prominent cluster of features with apparent 18:2 fatty-acid chains suggested that newborn exposure to the essential nutrient, linoleic acid, increased ALL risk. Interestingly, abundances of these putative 18:2 lipids were greater in infants who were fed formula rather than breast milk (colostrum) and increased with the mother's pre-pregnancy body mass index. These results suggest possible etiologic roles of newborn nutrition in late-diagnosis ALL.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.