In this paper we propose a Bayesian multi-output regressor stacking (BMORS) model that is a generalization of the multi-trait regressor stacking method. The proposed BMORS model consists of two stages: in the first stage, a univariate genomic best linear unbiased prediction (GBLUP including genotype × environment interaction GE) model is implemented for each of the L traits under study; then the predictions of all traits are included as covariates in the second stage, by implementing a Ridge regression model. The main objectives of this research were to study alternative models to the existing multi-trait multi-environment (BMTME) model with respect to (1) genomic-enabled prediction accuracy, and (2) potential advantages in terms of computing resources and implementation. We compared the predictions of the BMORS model to those of the univariate GBLUP model using 7 maize and wheat datasets. We found that the proposed BMORS produced similar predictions to the univariate GBLUP model and to the BMTME model in terms of prediction accuracy; however, the best predictions were obtained under the BMTME model. In terms of computing resources, we found that the BMORS is at least 9 times faster than the BMTME method. Based on our empirical findings, the proposed BMORS model is an alternative for predicting multi-trait and multi-environment data, which are very common in genomic-enabled prediction in plant and animal breeding programs.
The paradigm called genomic selection (GS) is a revolutionary way of developing new plants and animals. This is a predictive methodology, since it uses learning methods to perform its task. Unfortunately, there is no universal model that can be used for all types of predictions; for this reason, specific methodologies are required for each type of output (response variables). Since there is a lack of efficient methodologies for multivariate count data outcomes, in this paper, a multivariate Poisson deep neural network (MPDN) model is proposed for the genomic prediction of various count outcomes simultaneously. The MPDN model uses the minus log-likelihood of a Poisson distribution as a loss function, in hidden layers for capturing nonlinear patterns using the rectified linear unit (RELU) activation function and, in the output layer, the exponential activation function was used for producing outputs on the same scale of counts. The proposed MPDN model was compared to conventional generalized Poisson regression models and univariate Poisson deep learning models in two experimental data sets of count data. We found that the proposed MPDL outperformed univariate Poisson deep neural network models, but did not outperform, in terms of prediction, the univariate generalized Poisson regression models. All deep learning models were implemented in Tensorflow as back-end and Keras as front-end, which allows implementing these models on moderate and large data sets, which is a significant advantage over previous GS models for multivariate count data.
In genomic-enabled prediction, the task of improving the accuracy of the prediction of lines in environments is difficult because the available information is generally sparse and usually has low correlations between traits. In current genomic selection, although researchers have a large amount of information and appropriate statistical models to process it, there is still limited computing efficiency to do so. Although some statistical models are usually mathematically elegant, many of them are also computationally inefficient, and they are impractical for many traits, lines, environments, and years because they need to sample from huge normal multivariate distributions. For these reasons, this study explores two recommender systems: item-based collaborative filtering (IBCF) and the matrix factorization algorithm (MF) in the context of multiple traits and multiple environments. The IBCF and MF methods were compared with two conventional methods on simulated and real data. Results of the simulated and real data sets show that the IBCF technique was slightly better in terms of prediction accuracy than the two conventional methods and the MF method when the correlation was moderately high. The IBCF technique is very attractive because it produces good predictions when there is high correlation between items (environment–trait combinations) and its implementation is computationally feasible, which can be useful for plant breeders who deal with very large data sets.
When a plant scientist wishes to make genomic-enabled predictions of multiple traits measured in multiple individuals in multiple environments, the most common strategy for performing the analysis is to use a single trait at a time taking into account genotype · environment interaction (G · E), because there is a lack of comprehensive models that simultaneously take into account the correlated counting traits and G · E. For this reason, in this study we propose a multiple-trait and multiple-environment model for count data. The proposed model was developed under the Bayesian paradigm for which we developed a Markov Chain Monte Carlo (MCMC) with noninformative priors. This allows obtaining all required full conditional distributions of the parameters leading to an exact Gibbs sampler for the posterior distribution. Our model was tested with simulated data and a real data set. Results show that the proposed multi-trait, multi-environment model is an attractive alternative for modeling multiple count traits measured in multiple environments. KEYWORDScount phenotype multi-trait multi-environment Bayesian genomic-enabled prediction GenPred shared data resource genomic selection Plant breeders need more efficient models for performing genomic selection for multiple-traits and multiple-environments for count data. Count data are those dependent variables that take values 0, 1, 2,. . . without an explicit upper limit. These types of dependent variables are common in genomic selection, for example: panicle number per plant, seed number per plant, number of infected spikelets per plant, etc. Due to its simplicity and its ability to generate samples from high-dimensional probability distributions, the Gibbs sampler is one of the most popular computationally intensive methods for fitting complex multilevel models (Park and van Dyk 2009). This method is also very popular for modeling normal and binary responses when efficient closed-form Gibbs samplers have been developed. However, obtaining a closed-form Gibbs sampler for count data is not straightforward. For this reason, Montesinos-López et al. (2015, 2016a in the context of genomic-enabled prediction and genomic selection proposed closed-form Gibbs samplers for multilevel models for univariate count responses with and without the genotype · environment interaction (G · E) term that helps fill the lack of closed-form Gibbs samplers for count data. Although these models are helpful for modeling univariate count responses, many times breeders record phenotypic data for multiple counts. Scientists must take advantage of correlated traits to improve the prediction of unobserved genotypes and to increase the prediction accuracy of other count traits that are difficult to measure but that are associated with traits that are easy to measure. The available univariate count models are not appropriate for dealing with these situations.Since prediction problems are ubiquitous and of great interest and importance in statistical science, more attention has been given to param...
The primary objective of this paper is to provide a guide on implementing Bayesian generalized kernel regression methods for genomic prediction in the statistical software R. Such methods are quite efficient for capturing complex non-linear patterns that conventional linear regression models cannot. Furthermore, these methods are also powerful for leveraging environmental covariates, such as genotype × environment (G×E) prediction, among others. In this study we provide the building process of seven kernel methods: linear, polynomial, sigmoid, Gaussian, Exponential, Arc-cosine 1 and Arc-cosine L. Additionally, we highlight illustrative examples for implementing exact kernel methods for genomic prediction under a single-environment, a multi-environment and multi-trait framework, as well as for the implementation of sparse kernel methods under a multi-environment framework. These examples are followed by a discussion on the strengths and limitations of kernel methods and, subsequently by conclusions about the main contributions of this paper.
When multi-trait data are available, the preferred models are those that are able to account for correlations between phenotypic traits because when the degree of correlation is moderate or large, this increases the genomic prediction accuracy. For this reason, in this paper we explore Bayesian multi-trait kernel methods for genomic prediction and we illustrate the power of these models with three real datasets. The kernels under study were the linear, Gaussian, polynomial and sigmoid kernels; they were compared with the conventional Ridge regression and GBLUP multi-trait models. The results show that, in general, the Gaussian kernel method outperformed conventional Bayesian Ridge and GBLUP multi-trait linear models by 2.2 to 17.45% (datasets 1 to 3) in terms of prediction performance based on the mean square error of prediction. This improvement in terms of prediction performance of the Bayesian multi-trait kernel method can be attributed to the fact that the proposed model is able to capture non-linear patterns more efficiently than linear multi-trait models. However, not all kernels perform well in the datasets used for evaluation, which is why more than one kernel should be evaluated to be able to choose the best kernel.
IntroductionOver a third of the communities (39%) in the Central Valley of California, a richly diverse and important agricultural region, are classified as disadvantaged—with inadequate access to healthcare, lower socio-economic status, and higher exposure to air and water pollution. The majority of racial and ethnic minorities are also at higher risk of COVID-19 infection, hospitalization, and death according to the Centers for Disease Control and Prevention. Healthy Central Valley Together established a wastewater-based disease surveillance (WDS) program that aims to achieve greater health equity in the region through partnership with Central Valley communities and the Sewer Coronavirus Alert Network. WDS offers a cost-effective strategy to monitor trends in SARS-CoV-2 community infection rates.MethodsIn this study, we evaluated correlations between public health and wastewater data (represented as SARS-CoV-2 target gene copies normalized by pepper mild mottle virus target gene copies) collected for three Central Valley communities over two periods of COVID-19 infection waves between October 2021 and September 2022. Public health data included clinical case counts at county and sewershed scales as well as COVID-19 hospitalization and intensive care unit admissions. Lag-adjusted hospitalization:wastewater ratios were also evaluated as a retrospective metric of disease severity and corollary to hospitalization:case ratios.ResultsConsistent with other studies, strong correlations were found between wastewater and public health data. However, a significant reduction in case:wastewater ratios was observed for all three communities from the first to the second wave of infections, decreasing from an average of 4.7 ± 1.4 over the first infection wave to 0.8 ± 0.4 over the second.DiscussionThe decline in case:wastewater ratios was likely due to reduced clinical testing availability and test seeking behavior, highlighting how WDS can fill data gaps associated with under-reporting of cases. Overall, the hospitalization:wastewater ratios remained more stable through the two waves of infections, averaging 0.5 ± 0.3 and 0.3 ± 0.4 over the first and second waves, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.