Motivation One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous datasets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback–Leibler information divergence and the Yang–Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 measures for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 index that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature measurements. Results We evaluate the performance of our measures using extensive simulation studies and publicly available datasets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. Availability and implementation R code for the proposed methods is available at github.com/lburns27/Feature-Selection. Contact karthik.devarajan@fccc.edu Supplementary information Supplementary data are available at Bioinformatics online.
The past two decades have witnessed significant advances in high-throughput "omics" technologies such as genomics, proteomics, metabolomics, transcriptomics and radiomics. These technologies have enabled simultaneous measurement of the expression levels of tens of thousands of features from individual patient samples and have generated enormous amounts of data that require analysis and interpretation. One specific area of interest has been in studying the relationship between these features and patient outcomes, such as overall and recurrencefree survival, with the goal of developing a predictive "omics" profile. Large-scale studies often suffer from the presence of a large fraction of censored observations and potential time-varying effects of features, and methods for handling them have been lacking. In this paper, we propose supervised methods for feature selection and survival prediction that simultaneously deal with both issues. Our approach utilizes continuum power regression (CPR)-a framework that includes a variety of regression methods-in conjunction with the parametric or semi-parametric accelerated failure time (AFT) model. Both CPR and AFT fall within the linear models framework and, unlike black-box models, the proposed prognostic index has a simple yet useful interpretation. We demonstrate the utility of our methods using simulated and publicly available cancer genomics data. Supervised dimension reduction for large-scale "omics" data with censored survival outcomes under possible non-proportional hazards
The past two decades have witnessed significant advances in high-throughput "omics" technologies such as genomics, proteomics, metabolomics, transcriptomics and radiomics. These technologies have enabled the simultaneous measurement of the expression levels of tens of thousands of "omic" features from individual patient samples and have generated enormous amounts of data that require analysis and interpretation. One specific area of interest has been in studying the relationship between these features and patient outcomes such as overall and recurrence-free survival with the goal of developing a predictive "omics" profile. In this paper, we propose a supervised dimension reduction method for feature selection and survival prediction. Our approach utilizes continuum power regression -a framework that includes ordinary least squares, principal components regression and partial least squares -in conjunction with the parametric or semi-parametric accelerated failure time model and enables feature selection under possible non-proportional hazards. The proposed approach can handle censored observations using robust Buckley-James estimation in this high-dimensional setting and the parametric version employs the flexible generalized F model that encompasses a wide spectrum of well known survival models. We evaluate the predictive performance of our methods via extensive simulation studies and compare it to existing methods using publicly available data sets in cancer genomics.
One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease's process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous data sets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under-or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback-Leibler information divergence and the Yang-Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R 2 indices for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R 2 measure that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature expression. We evaluate the performance of our measures using extensive simulation studies and publicly available data sets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods. The proposed information divergence, R 2 and pseudo-R 2 measures were implemented in R (www.R-project.org) and code is available upon request.
Summary Various statistical methodologies embed a probability distribution in a more flexible family of distributions. The latter is called elaboration model, which is constructed by choice or a formal procedure and evaluated by asymmetric measures such as the likelihood ratio and Kullback–Leibler information. The use of asymmetric measures can be problematic for this purpose. This paper introduces two formal procedures, referred to as link functions, that embed any baseline distribution with a continuous density on the real line into model elaborations. Conditions are given for the link functions to render symmetric Kullback–Leibler divergence, Rényi divergence and phi‐divergence family. The first link function elaborates quantiles of the baseline probability distribution. This approach produces continuous counterparts of the binary probability models. Examples include the Cauchy, probit, logit, Laplace and Student's t links. The second link function elaborates the baseline survival function. Examples include the proportional odds and change point links. The logistic distribution is characterised as the one that satisfies the conditions for both links. An application demonstrates advantages of symmetric divergence measures for assessing the efficacy of covariates.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.