Search citation statements
Paper Sections
Citation Types
Year Published
Publication Types
Relationship
Authors
Journals
The DrugMatrix Database contains systematically generated toxicogenomics data from short-term in vivo studies for over 600 chemicals. However, most of the potential endpoints in the database are missing due to a lack of experimental measurements. We present our study on leveraging matrix factorization and machine learning methods to predict the missing values in the DrugMatrix, which includes gene expression across eight tissues on two expression platforms along with paired clinical chemistry, hematology, and histopathology measurements. One major challenge we encounter is the skewed distribution of the available measured data, in terms of both tissue sources and values. We propose a method, ToxiCompl, that applies systematic hybrid sampling guided by Bayesian optimization in conjunction with low-rank matrix factorization to recover the missing values. ToxiCompl achieves good training and validation performance from a machine learning perspective. We further conduct an in-depth validation of the predicted data from biological and toxicological perspectives with a series of analyses. These include examining the connectivity pattern of predicted gene expression responses, characterizing molecular pathway-level responses from sets of differentially expressed genes, evaluating known transcriptional biomarkers of tissue toxicity, and characterizing predicted apical endpoints. Our analysis shows that the predicted expression, broadly speaking, aligns with what would be anticipated. For example, in most instances, our prediction offers a connectivity level comparable to that of measured data in connectivity analysis. Using Havcr1, a known transcriptional biomarker of kidney injury, we identify treatments that, based on the predicted expression data, manifest kidney toxicity in a manner that is mechanistically plausible and supported by the literature. Characterization of the predicted clinical chemistry data suggests that strong effects are relatively reliably predicted, while more subtle effects pose a greater challenge. In the case of histopathological prediction, we find a significant overprediction due to positivity bias in the measured data. Developing methods to deal with this bias is one of the areas we plan to target for future improvement. The main advantage of the ToxiCompl approach is that, in the absence of additional experimental data, it drastically extends the toxicogenomic landscape into a number of data-poor tissues, thereby allowing researchers to formulate mechanistic hypotheses about effects in tissues that have been underrepresented in the literature. All predicted DrugMatrix data, along with the measured data, is available to the public through an intuitive GUI interface that allows for retrieval and gene set analysis (https://rstudio.niehs.nih.gov/ complete_drugmatrix/). Importantly, a clear distinction is made between the measured and predicted data, so researchers are aware of the data source they are working with. All data streams, including gene expression, clinical chemistry, hematology, and histopathology, are made available. We anticipate that this data will facilitate mechanistic and toxicological hypothesis generation regarding chemical effects in a variety of understudied tissues.
The DrugMatrix Database contains systematically generated toxicogenomics data from short-term in vivo studies for over 600 chemicals. However, most of the potential endpoints in the database are missing due to a lack of experimental measurements. We present our study on leveraging matrix factorization and machine learning methods to predict the missing values in the DrugMatrix, which includes gene expression across eight tissues on two expression platforms along with paired clinical chemistry, hematology, and histopathology measurements. One major challenge we encounter is the skewed distribution of the available measured data, in terms of both tissue sources and values. We propose a method, ToxiCompl, that applies systematic hybrid sampling guided by Bayesian optimization in conjunction with low-rank matrix factorization to recover the missing values. ToxiCompl achieves good training and validation performance from a machine learning perspective. We further conduct an in-depth validation of the predicted data from biological and toxicological perspectives with a series of analyses. These include examining the connectivity pattern of predicted gene expression responses, characterizing molecular pathway-level responses from sets of differentially expressed genes, evaluating known transcriptional biomarkers of tissue toxicity, and characterizing predicted apical endpoints. Our analysis shows that the predicted expression, broadly speaking, aligns with what would be anticipated. For example, in most instances, our prediction offers a connectivity level comparable to that of measured data in connectivity analysis. Using Havcr1, a known transcriptional biomarker of kidney injury, we identify treatments that, based on the predicted expression data, manifest kidney toxicity in a manner that is mechanistically plausible and supported by the literature. Characterization of the predicted clinical chemistry data suggests that strong effects are relatively reliably predicted, while more subtle effects pose a greater challenge. In the case of histopathological prediction, we find a significant overprediction due to positivity bias in the measured data. Developing methods to deal with this bias is one of the areas we plan to target for future improvement. The main advantage of the ToxiCompl approach is that, in the absence of additional experimental data, it drastically extends the toxicogenomic landscape into a number of data-poor tissues, thereby allowing researchers to formulate mechanistic hypotheses about effects in tissues that have been underrepresented in the literature. All predicted DrugMatrix data, along with the measured data, is available to the public through an intuitive GUI interface that allows for retrieval and gene set analysis (https://rstudio.niehs.nih.gov/ complete_drugmatrix/). Importantly, a clear distinction is made between the measured and predicted data, so researchers are aware of the data source they are working with. All data streams, including gene expression, clinical chemistry, hematology, and histopathology, are made available. We anticipate that this data will facilitate mechanistic and toxicological hypothesis generation regarding chemical effects in a variety of understudied tissues.
Multi-source data-fusion approaches have been developed for estimating regional precipitation. However, studies considering the specific upper limits of the improved gridded rainfall data for different fusion approaches are limited. Here, the potential ranges of accuracy improvement for satellite and reanalysis rainfall products were addressed using various machine learning fusion approaches, including multivariate linear regression (MLR), feedforward neural network (FNN), random forest (RF), and long short-term memory (LSTM), over the Chinese mainland. All four fusion methods reduce errors in the original precipitation products. The upper limits of accuracy improvement in terms of correlation coefficient (CC) and root mean square error (RMSE) were 30.65 and 15.27%, respectively. M-RF showed the best average CC (0.828) and RMSE (4.62 mm/day) in the four seasons. LSTM performed the best under light rainfall events, whereas MLR and RF exhibited better performance under moderate and heavy rainfall events, respectively. Overall, these results serve as a basis for the fusion approach and technique selection, based on the comprehensive validation in different climate zones, altitudes, and seasons over the Chinese mainland.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.