Graphs and networks provide a natural way of representing the dependency structure among variables. More and more network data are being generated from many scientific areas, including social networks, internet networks and biological networks. In biology, large-scale proteinprotein interaction and regulatory networks are now available. New statistical methods are required to effectively analyze these data and to incorporate the prior network information into analysis of the data observed on the nodes of the graphs. Besides simple summary statistics for these networks, there is a need for formal probability models on how networks are generated. Block models and models with network motifs have been developed. In biology, how to effectively incorporate the prior network information into analysis of high dimensional genomics data raises new challenges for methodological developments. There are also needs for methods that can integrate various types of data in order to infer the network structures, especially in high dimensional settings. New methods for studying how diseases spread through networks and for designing intervention strategies using the network information are also needed. A special issue of Statistics in Biosciences has been planned that will highlight statistical methods for analysis of networks and graphs and novel applications of network methods. Submissions preferably will provide methodological advances in modeling graphs and networks, in integrating network/graph information into analysis of numerical data measured on the nodes, in learning network/graph structures from various types of data.
In this paper, we develop a graphical modeling framework for the inference of networks across multiple sample groups and data types. In medical studies, this setting arises whenever a set of subjects, which may be heterogeneous due to differing disease stage or subtype, is profiled across multiple platforms, such as metabolomics, proteomics, or transcriptomics data. Our proposed Bayesian hierarchical model first links the network structures within each platform using a Markov random field prior to relate edge selection across sample groups, and then links the network similarity parameters across platforms. This enables 2 E. SHADDOX AND OTHERS our model formulation allows the number of variables and number of subjects to differ across the data types, and only requires that we have data for the same set of groups. We illustrate the proposed approach through both simulation studies and an application to gene expression levels and metabolite abundances on subjects with varying severity levels of Chronic Obstructive Pulmonary Disease (COPD). Markov random field prior; spike and slab prior; chronic obstructive pulmonary disease (COPD) INTRODUCTIONGaussian graphical models, which describe the dependence relations among a set of random variables, have been widely applied to estimate biological networks on the basis of high-throughput data. When all samples are collected under similar conditions or reflect a single type of disease, methods such as the graphical lasso (Meinshausen and Bühlmann, 2006;Yuan and Lin, 2007; Friedman and others, 2008) or Bayesian network inference approaches (Roverato, 2002;Wang and Li, 2012) can be applied to infer a sparse network. In many studies, however, such as the COPDGene study (Regan and others, 2010) of this paper, described below, samples are obtained for different subtypes or disease, varying experimental settings, or other heterogeneous conditions. In this setting, applying standard graphical model inference approaches to the pooled data across conditions will lead to spurious findings, while separate estimation for each subgroup reduces statistical power. The challenge becomes even more formidable when multiple data types (or platforms) are under consideration, specifically gene expression levels and metabolite abundances in the COPDGene study, measured on multiple subjects. In this case, pooling the data is not appropriate, as it ignores the fact that direct connections between variables of different data types may not be sensible. Nonetheless, analyzing data from each platform separately ignores potential commonalities, for example, that subjects with more advanced disease may have more extensive disruption of functional mechanisms across data types. The need for statistical methods to address these questions is particularly Bayesian Inference of Networks 3 pressing given the increasing number of studies investing in comprehensive profiling of subjects across multiple data platforms. Our proposed statistical method enables joint inference of networks across sample groups and dat...
When analyzing large datasets from high-throughput technologies, researchers often encounter missing quantitative measurements, which are particularly frequent in metabolomics datasets. Metabolomics, the comprehensive profiling of metabolite abundances, are typically measured using mass spectrometry technologies that often introduce missingness via multiple mechanisms: (1) the metabolite signal may be smaller than the instrument limit of detection; (2) the conditions under which the data are collected and processed may lead to missing values; (3) missing values can be introduced randomly. Missingness resulting from mechanism (1) would be classified as Missing Not At Random (MNAR), that from mechanism (2) would be Missing At Random (MAR), and that from mechanism (3) would be classified as Missing Completely At Random (MCAR). Two common approaches for handling missing data are the following: (1) omit missing data from the analysis; (2) impute the missing values. Both approaches may introduce bias and reduce statistical power in downstream analyses such as testing metabolite associations with clinical variables. Further, standard imputation methods in metabolomics often ignore the mechanisms causing missingness and inaccurately estimate missing values within a data set. We propose a mechanism-aware imputation algorithm that leverages a two-step approach in imputing missing values. First, we use a random forest classifier to classify the missing mechanism for each missing value in the data set. Second, we impute each missing value using imputation algorithms that are specific to the predicted missingness mechanism (i.e., MAR/MCAR or MNAR). Using complete data, we conducted simulations, where we imposed different missingness patterns within the data and tested the performance of combinations of imputation algorithms. Our proposed algorithm provided imputations closer to the original data than those using only one imputation algorithm for all the missing values. Consequently, our two-step approach was able to reduce bias for improved downstream analyses.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.