A better understanding of disease progression is beneficial for early diagnosis and appropriate individual therapy. Many different approaches for statistical modelling of cumulative disease progression have been proposed in the literature, including simple path models up to complex restricted Bayesian networks. Important fields of application are diseases such as cancer and HIV. Tumour progression is measured by means of chromosome aberrations, whereas people infected with HIV develop drug resistances because of genetic changes of the HI-virus. These two very different diseases have typical courses of disease progression, which can be modelled partly by consecutive and partly by independent steps. This paper gives an overview of the different progression models and points out their advantages and drawbacks. Different models are compared via simulations to analyse how they work if some of their assumptions are violated. In a simulation study, we evaluate how models perform in terms of fitting induced multivariate probability distributions and topological relationships. We often find that the true model class used for generating data is outperformed by either a less or a more complex model class. The more flexible conjunctive Bayesian networks can be used to fit oncogenetic trees, whereas mixtures of oncogenetic trees with three tree components can be well fitted by mixture models with only two tree components.
BackgroundDisease progression models are important for understanding the critical steps during the development of diseases. The models are imbedded in a statistical framework to deal with random variations due to biology and the sampling process when observing only a finite population. Conditional probabilities are used to describe dependencies between events that characterise the critical steps in the disease process.Many different model classes have been proposed in the literature, from simple path models to complex Bayesian networks. A popular and easy to understand but yet flexible model class are oncogenetic trees. These have been applied to describe the accumulation of genetic aberrations in cancer and HIV data. However, the number of potentially relevant aberrations is often by far larger than the maximal number of events that can be used for reliably estimating the progression models. Still, there are only a few approaches to variable selection, which have not yet been investigated in detail.ResultsWe fill this gap and propose specifically for oncogenetic trees ten variable selection methods, some of these being completely new. We compare them in an extensive simulation study and on real data from cancer and HIV. It turns out that the preselection of events by clique identification algorithms performs best. Here, events are selected if they belong to the largest or the maximum weight subgraph in which all pairs of vertices are connected.ConclusionsThe variable selection method of identifying cliques finds both the important frequent events and those related to disease pathways.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-017-1762-1) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.