Yu-Jyun Huang scite author profile

et al. 2021

Motivation Facilitated by technological advances and the decrease in costs, it is feasible to gather subject data from several omics platforms. Each platform assesses different molecular events, and the challenge lies in efficiently analyzing these data to discover novel disease genes or mechanisms. A common strategy is to regress the outcomes on all omics variables in a gene set. However, this approach suffers from problems associated with high-dimensional inference. Results We introduce a tensor-based framework for variable-wise inference in multi-omics analysis. By accounting for the matrix structure of an individual’s multi-omics data, the proposed tensor methods incorporate the relationship among omics effects, reduce the number of parameters, and boost the modeling efficiency. We derive the variable-specific tensor test and enhance computational efficiency of tensor modeling. Using simulations and data applications on the Cancer Cell Line Encyclopedia (CCLE), we demonstrate our method performs favorably over baseline methods and will be useful for gaining biological insights in multi-omics analysis. Availability and Implementation R function and instruction are available from the authors’ website: https://www4.stat.ncsu.edu/∼jytzeng/Software/TR.omics/TRinstruction.pdf Supplementary information Supplementary materials are available at Bioinformatics online.

Application of graphical lasso in estimating network structure in gene set

Hsiao

2020

Ann Transl Med

Probabilistic edge inference of gene networks with markov random field-based bayesian learning

Mukherjee²,

Hsiao³

2022

Front. Genet.

Current algorithms for gene regulatory network construction based on Gaussian graphical models focuses on the deterministic decision of whether an edge exists. Both the probabilistic inference of edge existence and the relative strength of edges are often overlooked, either because the computational algorithms cannot account for this uncertainty or because it is not straightforward in implementation. In this study, we combine the Bayesian Markov random field and the conditional autoregressive (CAR) model to tackle simultaneously these two tasks. The uncertainty of edge existence and the relative strength of edges can be measured and quantified based on a Bayesian model such as the CAR model and the spike-and-slab lasso prior. In addition, the strength of the edges can be utilized to prioritize the importance of the edges in a network graph. Simulations and a glioblastoma cancer study were carried out to assess the proposed model’s performance and to compare it with existing methods when a binary decision is of interest. The proposed approach shows stable performance and may provide novel structures with biological insights.

The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis

Lai

et al. 2021

Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with an implicit assumption that the multivariate expression values are normally distributed. This assumption is commonly adopted in GSAs, particularly those in the group of functional class scoring (FCS) methods. The validity of the normality assumption, however, has been disputed in several studies, yet no systematic analysis has been carried out to assess the effect of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal distribution (MVN). Six statistical methods in three categories of MVN tests were considered and applied to a total of twenty-four RNA data sets. These RNA values were collected from cancer patients as well as normal subjects, and the values were derived from microarray experiments, RNA sequencing, and single-cell RNA sequencing. Our first finding suggests that the MVN assumption is not always satisfied. This assumption does not hold true in many applications tested here. In the second part of this research, we evaluated the influence of non-normality on the statistical power of current FCS methods, both parametric and non-parametric ones. Specifically, the scenario of mixture distributions representing more than one population for the RNA values was considered. This second investigation demonstrates that the non-normality distribution of the RNA values causes a loss in the statistical power of these GSA tests, especially when subtypes exist. Among the FCS GSA tools examined here and among the scenarios studied in this research, the N-statistics outperform the others. Based on the results from these two investigations, we conclude that the assumption of MVN should be used with caution when evaluating new GSA tools, since this assumption cannot be guaranteed and violation may lead to spurious results, loss of power, and incorrect comparison between methods. If a newly proposed GSA tool is to be evaluated, we recommend the incorporation of a wide range of multivariate non-normal distributions or sampling from large databases if available.

Probabilistic Edge Inference of Gene Networks with Bayesian Markov Random Field Modelling

Mukherjee

Hsiao

2022

Preprint

Gaussian graphical models (GGMs), also known as Gaussian Markov random field (MRF) models, are commonly used for gene regulatory network construction. Most current approaches to estimating network structure via GGMs can be categorized into a binary decision that determines if an edge exists through penalized optimization and a probabilistic approach that incorporates graph uncertainty. Analyses in the first category usually adopt the perspective of variable (edge) selection without consideration of probabilistic interpretation. Methods in the second group, particularly the Bayesian approach, often quantify the uncertainty in the network structure with a stochastic measure on the precision matrix. Nevertheless, these methods overlook the existence probability of an edge and its strength related to the dependence between nodes. This study simultaneously investigates the existence and intensity of edges for network structure learning. We propose a method that combines the Bayesian MRF model and conditional autoregressive model for the relationship between gene nodes. This analysis can evaluate the relative strength of the edges and further prioritize the edges of interest. Simulations and a glioblastoma cancer study were carried out to assess the performance of our proposed models and compare it with existing methods. The proposed approach shows stable performance and may identify novel structures with biological insights.