Clustering is a challenging problem in unsupervised learning. In lieu of a gold standard, stability has become a valuable surrogate to performance and robustness. In this work, we propose a non-parametric bootstrapping approach to estimating the stability of a clustering method, which also captures stability of the individual clusters and observations. This flexible framework enables different types of comparisons between clusterings and can be used in connection with two if possible bootstrap approaches for stability. The first approach, scheme 1, can be used to assess confidence (stability) around clustering from the original dataset based on bootstrap replications. A second approach, scheme 2, searches over the bootstrap clusterings for an optimally stable partitioning of the data. The two schemes accommodate different model assumptions that can be motivated by an investigator's trust (or lack thereof) in the original data and additional computational considerations. We propose a hierarchical visualization extrapolated from the stability profiles that give insights into the separation of groups, and projected visualizations for the inspection of the stability of individual operations. Our approaches show good performance in simulation and on real data. These approaches can be implemented using the R package bootcluster that is available on the Comprehensive R Archive Network (CRAN).
Background:The metabolome is a collection of exogenous chemicals and metabolites from cellular processes that may reflect the body’s response to environmental exposures. Studies of air pollution and metabolomics are limited.Objectives:To explore changes in the human metabolome before, during, and after the 2008 Beijing Olympics Games, when air pollution was high, low, and high, respectively.Methods:Serum samples were collected before, during, and after the Olympics from 26 participants in an existing panel study. Gas and ultra-high performance liquid chromatography/mass spectrometry were used in metabolomics analysis. Repeated measures ANOVA, network analysis, and enrichment analysis methods were employed to identify metabolites and classes associated with air pollution changes.Results:A total of 886 molecules were measured in our metabolomics analysis. Network partitioning identified four modules with 65 known metabolites that significantly changed across the three time points. All known molecules in the first module (n=33) were lipids (e.g., eicosapentaenoic acid, stearic acid). The second module consisted primarily of dipeptides (n=24, e.g., isoleucylglycine) plus 8 metabolites from four other classes (e.g., hypoxanthine, 12-hydroxyeicosatetraenoic acid). Most of the metabolites in Modules 3 (19 of 23) and 4 (5 of 5) were unknown. Enrichment analysis of module-identified metabolites indicted significantly overrepresented pathways, including long- and medium-chain fatty acids, polyunsaturated fatty acids (n3 and n6), eicosanoids, lysolipid, dipeptides, fatty acid metabolism, and purine metabolism [(hypo) xanthine/inosine–containing pathways].Conclusions:We identified two major metabolic signatures: one consisting of lipids, and a second that included dipeptides, polyunsaturated fatty acids, taurine, and xanthine. Metabolites in both groups decreased during the 2008 Beijing Olympics, when air pollution was low, and increased after the Olympics, when air pollution returned to normal (high) levels. https://doi.org/10.1289/EHP3705
The microbiome influences health and disease through complex networks of host genetics, genomics, microbes, and environment. Identifying the mechanisms of these interactions has remained challenging. Systems genetics in laboratory mice (Mus musculus) enables data-driven discovery of biological network components and mechanisms of host–microbial interactions underlying disease phenotypes. To examine the interplay among the whole host genome, transcriptome, and microbiome, we mapped QTL and correlated the abundance of cecal messenger RNA, luminal microflora, physiology, and behavior in a highly diverse Collaborative Cross breeding population. One such relationship, regulated by a variant on chromosome 7, was the association of Odoribacter (Bacteroidales) abundance and sleep phenotypes. In a test of this association in the BKS.Cg-Dock7m +/+ Leprdb/J mouse model of obesity and diabetes, known to have abnormal sleep and colonization by Odoribacter, treatment with antibiotics altered sleep in a genotype-dependent fashion. The many other relationships extracted from this study can be used to interrogate other diseases, microbes, and mechanisms.
Telomere length is a heritable marker of cellular age that is associated with morbidity and mortality. Poor sleep behaviors, which are also associated with adverse health events, may be related to leukocyte telomere length (LTL). We studied a subpopulation of 3,145 postmenopausal women (1,796 European-American (EA) and 1,349 African-American (AA)) enrolled in the Women’s Health Initiative in 1993–1998 with data on Southern blot-measured LTL and self-reported usual sleep duration and sleep disturbance. LTL-sleep associations were analyzed separately for duration and disturbance using weighted and confounder-adjusted linear regression models in the entire sample (AAs + EAs; adjusted for race/ethnicity) and in racial/ethnic strata, since LTL differs by ancestry. After adjustment for covariates, each additional daily hour of sleep beyond 5 hours, approximately, was associated with a 27-base-pair (95% confidence interval (CI): 6, 48) longer LTL in the entire sample. Associations between sleep duration and LTL were strongest among AAs (adjusted β = 37, 95% CI: 4, 70); a similar, nonsignificant association was observed for EAs (adjusted β = 20, 95% CI: −7, 48). Sleep disturbance was not associated with LTL in our study. Our models did not show departure from linearity (quadratic sleep terms: P ≥ 0.55). Our results suggest that longer sleep duration is associated with longer LTL in postmenopausal women.
Graphs can be used to represent the direct and indirect relationships between variables, and elucidate complex relationships and interdependencies. Detecting structure within a graph is a challenging problem. This problem is studied over a range of fields and is sometimes termed community detection, module detection, or graph partitioning. A popular class of algorithms for module detection relies on optimizing a function of modularity to identify the structure. In practice, graphs are often learned from the data, and thus prone to uncertainty. In these settings, the uncertainty of the network structure can become exaggerated by giving unreliable estimates of the module structure. In this work, we begin to address this challenge through the use of a nonparametric bootstrap approach to assessing the stability of module detection in a graph. Estimates of stability are presented at the level of the individual node, the inferred modules, and as an overall measure of performance for module detection in a given graph. Furthermore, bootstrap stability estimates are derived for complexity parameter selection that ultimately defines a graph from data in a way that optimizes stability. This approach is utilized in connection with correlation graphs but is generalizable to other graphs that are defined through the use of dissimilarity measures. We demonstrate our approach using a broad range of simulations and on a metabolomics dataset from the Beijing Olympics Air Pollution study. These approaches are implemented using bootcluster package that is available in the R programming language.
Background The CBR3 V244M single nucleotide polymorphism has been linked to the risk of anthracycline-related cardiomyopathy in survivors of childhood cancer. There have been limited prospective studies examining the impact of CBR3 V244M on the risk for anthracycline-related cardiotoxicity in adult cohorts. Objectives This study evaluated the presence of associations between CBR3 V244M genotype status and changes in echocardiographic parameters in breast cancer patients undergoing doxorubicin treatment. Methods We recruited 155 patients with breast cancer receiving treatment with doxorubicin (DOX) at Roswell Park Comprehensive Care Center (Buffalo, NY) to a prospective single arm observational pharmacogenetic study. Patients were genotyped for the CBR3 V244M variant. 92 patients received an echocardiogram at baseline (t0 month) and at 6 months (t6 months) of follow up after DOX treatment. Apical two-chamber and four-chamber echocardiographic images were used to calculate volumes and left ventricular ejection fraction (LVEF) using Simpson’s biplane rule by investigators blinded to all patient data. Volumetric indices were evaluated by normalizing the cardiac volumes to the body surface area (BSA). Results Breast cancer patients with CBR3 GG and AG genotypes both experienced a statistically significant reduction in LVEF at 6 months following initiation of DOX treatment for breast cancer compared with their pre-DOX baseline study. Patients homozygous for the CBR3 V244M G allele (CBR3 V244) exhibited a further statistically significant decrease in LVEF at 6 months following DOX therapy in comparison with patients with heterozygous AG genotype. We found no differences in age, pre-existing cardiac diseases associated with myocardial injury, cumulative DOX dose, or concurrent use of cardioprotective medication between CBR3 genotype groups. Conclusions CBR3 V244M genotype status is associated with changes in echocardiographic parameters suggestive of early anthracycline-related cardiomyopathy in subjects undergoing chemotherapy for breast cancer.
Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different objective functions, different parameters, and dissimilarity measures. The purpose of clustering is versatile, often playing critical roles in the early stages of exploratory data analysis and as an endpoint for knowledge and discovery. Thus, understanding the quality of a clustering is of critical importance. The concept of stability has emerged as a strategy for assessing the performance and reproducibility of data clustering. The key idea is to produce perturbed data sets that are very close to the original, and cluster them. If the clustering is stable, then the clusters from the original data will be preserved in the perturbed data clustering. The nature of the perturbation, and the methods for quantifying similarity between clusterings, are nontrivial, and ultimately what distinguishes many of the stability estimation methods apart. In this review, we provide an overview of the very active research area of cluster stability estimation and discuss some of the open questions and challenges that remain in the field. This article is categorized under:Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.