Model-based clustering consists of fitting a mixture model to data and identifying each cluster with one of its components. Multivariate normal distributions are typically used. The number of clusters is usually determined from the data, often using BIC. In practice, however, individual clusters can be poorly fitted by Gaussian distributions, and in that case model-based clustering tends to represent one non-Gaussian cluster by a mixture of two or more Gaussian distributions. If the number of mixture components is interpreted as the number of clusters, this can lead to overestimation of the number of clusters. This is because BIC selects the number of mixture components needed to provide a good approximation to the density, rather than the number of clusters as such. We propose first selecting the total number of Gaussian mixture components, K, using BIC and then combining them hierarchically according to an entropy criterion. This yields a unique soft clustering for each number of clusters less than or equal to K. These clusterings can be compared on substantive grounds, and we also describe an automatic way of selecting the number of clusters via a piecewise linear regression fit to the rescaled entropy plot. We illustrate the method with simulated data and a flow cytometry dataset. Supplemental materials are available on the journal web site and described at the end of the article.
The capability of flow cytometry to offer rapid quantification of multidimensional characteristics for millions of cells has made this technology indispensable for health research, medical diagnosis, and treatment. However, the lack of statistical and bioinformatics tools to parallel recent high-throughput technological advancements has hindered this technology from reaching its full potential. We propose a flexible statistical model-based clustering approach for identifying cell populations in flow cytometry data based on t-mixture models with a Box-Cox transformation. This approach generalizes the popular Gaussian mixture models to account for outliers and allow for nonelliptical clusters. We describe an Expectation-Maximization (EM) algorithm to simultaneously handle parameter estimation and transformation selection. Using two publicly available datasets, we demonstrate that our proposed methodology provides enough flexibility and robustness to mimic manual gating results performed by an expert researcher. In addition, we present results from a simulation study, which show that this new clustering framework gives better results in terms of robustness to model misspecification and estimation of the number of clusters, compared to the popular mixture models. The proposed clustering methodology is well adapted to automated analysis of flow cytometry data. It tends to give more reproducible results, and helps reduce the significant subjectivity and human time cost encountered in manual gating analysis.
Background: As a high-throughput technology that offers rapid quantification of multidimensional characteristics for millions of cells, flow cytometry (FCM) is widely used in health research, medical diagnosis and treatment, and vaccine development. Nevertheless, there is an increasing concern about the lack of appropriate software tools to provide an automated analysis platform to parallelize the high-throughput data-generation platform. Currently, to a large extent, FCM data analysis relies on the manual selection of sequential regions in 2-D graphical projections to extract the cell populations of interest. This is a time-consuming task that ignores the high-dimensionality of FCM data.
The inference of regulatory and biochemical networks from largescale genomics data is a basic problem in molecular biology. The goal is to generate testable hypotheses of gene-to-gene influences and subsequently to design bench experiments to confirm these network predictions. Coexpression of genes in large-scale geneexpression data implies coregulation and potential gene-gene interactions, but provide little information about the direction of influences. Here, we use both time-series data and genetics data to infer directionality of edges in regulatory networks: time-series data contain information about the chronological order of regulatory events and genetics data allow us to map DNA variations to variations at the RNA level. We generate microarray data measuring time-dependent gene-expression levels in 95 genotyped yeast segregants subjected to a drug perturbation. We develop a Bayesian model averaging regression algorithm that incorporates external information from diverse data types to infer regulatory networks from the time-series and genetics data. Our algorithm is capable of generating feedback loops. We show that our inferred network recovers existing and novel regulatory relationships. Following network construction, we generate independent microarray data on selected deletion mutants to prospectively test network predictions. We demonstrate the potential of our network to discover de novo transcription-factor binding sites. Applying our construction method to previously published data demonstrates that our method is competitive with leading network construction algorithms in the literature.L arge-scale sequencing has provided a wealth of data on the presence, absence, and variation of genes within and between species. However, functional annotation is unavailable for many genes and the majority of genes within most species are not placed within regulatory or biochemical pathways. Classic biochemical methods for placing genes in pathways cannot keep pace with the rapidly increasing amount of genomic information. To address this problem, we and others have been developing methods to infer networks from large-scale functional genomics data (1-5). The overall goals of such methods are to generate predictions of systems behavior and testable hypotheses of gene-to-gene influences. Predictions of systems behavior can be useful even in the absence of detailed mechanistic understanding. For example, the predicted response to the inhibition of a given gene can guide the selection of drug targets (6). The generation of testable hypotheses provides a path to more rapidly gain mechanistic understanding as it focuses bench experiments on subsets of potential gene-to-gene influences. Moreover, network construction and experimental work can be used in an iterative process to converge on underlying mechanisms (7,8).At present, the data most used in network construction methods are from large-scale gene-expression studies. Coexpression of genes across a wide variety of experimental conditions implies coregulation (9, ...
Background: The triglyceride-glucose (TyG) index could serve as a convenient substitute of insulin resistance (IR), but epidemiological evidence on its relationship with the long-term risk of mortality is limited.Methods: Participants from the National Health and Nutrition Examination Survey during 1999–2014 were grouped according to TyG index (<8, 8–9, 9–10, >10). Cox regression was conducted to compute the hazard ratios (HRs) and 95% confidence interval (CI). Restricted cubic spline and piecewise linear regression were performed to detect the shape of the relationship between TyG index and mortality.Results: A total of 19,420 participants (48.9% men) were included. On average, participants were followed-up for 98.2 months, and 2,238 (11.5%) and 445 (2.3%) cases of mortality due to all-cause or cardiovascular disease were observed. After adjusting for confounders, TyG index was independently associated with an elevated risk of all-cause (HR, 1.10; 95% CI, 1.00–1.20) and cardiovascular death (HR, 1.29; 95% CI, 1.05–1.57). Spline analyses showed that the relationship of TyG index with mortality was non-linear (All non-linear P < 0.001), and the threshold value were 9.36 for all-cause and 9.52 for cardiovascular death, respectively. The HRs above the threshold point were 1.50 (95% CI, 1.29–1.75) and 2.35 (95% CI, 1.73–3.19) for all-cause and cardiovascular death. No significant difference was found below the threshold points (All P > 0.05).Conclusion: Elevated TyG index reflected a more severe IR and was associated with mortality due to all-cause and cardiovascular disease in a non-linear manner.
The supplementary material is available at http://www.stat.ubc.ca/~c.lo/FEBarrays/supp.pdf.
BackgroundInference about regulatory networks from high-throughput genomics data is of great interest in systems biology. We present a Bayesian approach to infer gene regulatory networks from time series expression data by integrating various types of biological knowledge.ResultsWe formulate network construction as a series of variable selection problems and use linear regression to model the data. Our method summarizes additional data sources with an informative prior probability distribution over candidate regression models. We extend the Bayesian model averaging (BMA) variable selection method to select regulators in the regression framework. We summarize the external biological knowledge by an informative prior probability distribution over the candidate regression models.ConclusionsWe demonstrate our method on simulated data and a set of time-series microarray experiments measuring the effect of a drug perturbation on gene expression levels, and show that it outperforms leading regression-based methods in the literature.
Context Although the role of iron in the development of type 2 diabetes (T2D) has long been a concern, prospective studies directly linking body iron stores to T2D risk in a sex-dependent context have been inconsistent. Objective A systematic meta-analysis was conducted to explore the sex-specific association of circulating ferritin with T2D risk. Data Sources We searched PubMed, Web of Science, and EMBASE databases to identify available prospective studies through 1 August 2018. Results Fifteen prospective studies comprising 77,352 participants and 18,404 patients with T2D, aged 20 to 80 years, and with ∼3 to 17 years of follow-up were identified. For each 100-μg/L increment in ferritin levels of overall participants, T2D risk increased by 22% (RR, 1.22; 95% CI, 1.14 to 1.31). Of note, major heterogeneities by sex were identified, with increased ferritin level having an apparently greater effect on T2D risk in women (RR, 1.53; 95% CI, 1.29 to 1.82) than in men (RR, 1.21; 95% CI, 1.15 to 1.27) after exclusion of a study with high heterogeneity (41,512 men and 6974 women for sex-specific analyses; P = 0.020 for sex difference). Further nonlinear analysis between circulating ferritin and T2D risk also showed sex-dimorphic association in that the T2D risk of women was twice as strong in magnitude as that of men at the same ferritin level. Conclusions Greater circulating ferritin levels were independently associated with increased T2D risk, which appeared stronger among women than men. Our findings provide prospective evidence for further testing of the utility of ferritin levels in predicting T2D risk in a sex-specific manner.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.