Algorithm 1: scPCA Result: Produces a sparse low-dimensional representation of the target data, X n×p , by contrasting the variation of X n×p and some background data, Y m×p , while applying an 1 penalty to the loadings generated by cPCA. Input : target dataset: X background dataset: Y binary variable indicating whether to column-scale the data: scale vector of possible contrastive parameters: γ = (γ 1 , . . . , γ s ) vector of possible 1 penalty parameters: λ 1 = (λ 1,1 , . . . , λ 1,d ) number of sparse contrastive principal components to compute: k clustering method: cluster_meth number of clusters: nclusterCenter (and scale if so desired) the columns of X, Y Calculate the empirical covariance matrices:Compute the positive-semidefinite approximation of C γi , C γi Apply SPCA to C γi for k components with 1 penalty λ 1,j Generate a low-dimensional representation by projecting X n×p on the sparse loadings of SPCA Normalize the low-dimensional representation produced to be on the unit hypercube Cluster the normalized low-dimensional representation using cluster_meth with ncluster Compute and record the clustering strength criterion associated with (γ i , λ 1,j )Identify the combination of hyperparameters maximizing the clustering strength criterion: γ , λ 1 Output: The low-dimensional representation of the target data given by (γ , λ 1 ), an n × k matrix; the p × k matrix of loadings given by (γ , λ 1 ); contrastive parameter γ ; 1 penalty parameter λ 1
Motivation: Statistical analyses of high-throughput sequencing data have re-shaped the biological sciences. In spite of myriad advances, recovering interpretable biological signal from data corrupted by technical noise remains a prevalent open problem. Several classes of procedures, among them classical dimensionality reduction techniques, and others incorporating subject-matter knowledge, have provided e ective advances; however, no procedure currently satisfies the dual objectives of recovering stable and relevant features simultaneously. Results: Inspired by recent proposals for making use of control data in the removal of unwanted variation, we propose a variant of principal component analysis that extracts sparse, stable, interpretable, and relevant biological signal. The new methodology is compared to competing dimensionality reduction approaches through a simulation study as well as via analyses of several publicly available protein expression, microarray gene expression, and single-cell transcriptome sequencing datasets. Availability:A free and open-source software implementation of the methodology, the scPCA R package, is made available via the Bioconductor Project. Code for all analyses presented in the paper are also made available via GitHub.
Background Exposure to arsenic affects millions of people globally. Changes in the epigenome may be involved in pathways linking arsenic to health or serve as biomarkers of exposure. This study investigated associations between prenatal and early-life arsenic exposure and epigenetic age acceleration (EAA) in adults, a biomarker of morbidity and mortality. Methods DNA methylation was measured in PBMCs and buccal cells from 40 adults (median age = 49 years) in Chile with and without high prenatal and early-life arsenic exposure. EAA was calculated using the Horvath, Hannum, PhenoAge, skin and blood, GrimAge, and DNAm telomere length clocks. We evaluated associations between arsenic exposure and EAA using robust linear models. Results Participants classified as with and without arsenic exposure had a median drinking water arsenic concentration at birth of 555 and 2ug/L, respectively. In PBMCs, adjusting for sex and smoking, exposure was associated with a six-year PhenoAge acceleration in PBMCs [B (95% CI) = 6.01 (2.60, 9.42)]. After adjusting for cell type composition, we found positive associations with Hannum EAA [B (95% CI) = 3.11 (0.13, 6.10)], skin and blood EAA [B (95% CI) = 1.77 (0.51, 3.03)], and extrinsic EAA [B (95% CI) = 4.90 (1.22, 8.57)]. The association with PhenoAge acceleration in buccal cells was positive but not statistically significant [B (95% CI) = 4.88 (-1.60, 11.36)]. Conclusions Arsenic exposure limited to early-life stages may be associated with biological aging in adulthood. Future research may provide information on how EAA programmed in early-life is related to health.
Background Arsenic (As) exposure through drinking water is a global public health concern. Epigenetic dysregulation including changes in DNA methylation (DNAm), may be involved in arsenic toxicity. Epigenome-wide association studies (EWAS) of arsenic exposure have been restricted to single populations and comparison across EWAS has been limited by methodological differences. Leveraging data from epidemiological studies conducted in Chile and Bangladesh, we use a harmonized data processing and analysis pipeline and meta-analysis to combine results from four EWAS. Methods DNAm was measured among adults in Chile with and without prenatal and early-life As exposure in PBMCs and buccal cells (N = 40, 850K array) and among men in Bangladesh with high and low As exposure in PBMCs (N = 32, 850K array; N = 48, 450K array). Linear models were used to identify differentially methylated positions (DMPs) and differentially variable positions (DVPs) adjusting for age, smoking, cell type, and sex in the Chile cohort. Probes common across EWAS were meta-analyzed using METAL, and differentially methylated and variable regions (DMRs and DVRs, respectively) were identified using comb-p. KEGG pathway analysis was used to understand biological functions of DMPs and DVPs. Results In a meta-analysis restricted to PBMCs, we identified one DMP and 23 DVPs associated with arsenic exposure; including buccal cells, we identified 3 DMPs and 19 DVPs (FDR < 0.05). Using meta-analyzed results, we identified 11 DMRs and 11 DVRs in PBMC samples, and 16 DMRs and 19 DVRs in PBMC and buccal cell samples. One region annotated to LRRC27 was identified as a DMR and DVR. Arsenic-associated KEGG pathways included lysosome, autophagy, and mTOR signaling, AMPK signaling, and one carbon pool by folate. Conclusions Using a two-step process of (1) harmonized data processing and analysis and (2) meta-analysis, we leverage four DNAm datasets from two continents of individuals exposed to high levels of As prenatally and during adulthood to identify DMPs and DVPs associated with arsenic exposure. Our approach suggests that standardizing analytical pipelines can aid in identifying biological meaningful signals.
Adenoid cystic carcinoma (ACC) is the second most common cancer type arising from the salivary gland. The frequent occurrence of chromosome t(6,9) translocation leading to the fusion of MYB and NFIB transcription factor genes is considered a genetic hallmark of ACC. This inter-chromosomal rearrangement may encode multiple variants of functional MYB-NFIB fusion in ACC. However, the lack of an ACC model that harbors the t(6,9) translocation has limited studies on defining the potential function and implication of chimeric MYB-NFIB protein in ACC. This report aims to establish a MYB-NFIB fusion protein expressing system in ACC cells for in vitro and in vivo studies. RNA-seq data from MYB-NFIB translocation positive ACC patients’ tumors and MYB-NFIB fusion transcript in ACC patient-derived xenografts (ACCX) was analyzed to identify MYB breakpoints and their frequency of occurrence. Based on the MYB breakpoint identified, variants of MYB-NFIB fusion expression system were developed in a MYB-NFIB deficient ACC cell lines. Analysis confirmed MYB-NFIB fusion protein expression in ACC cells and ACCXs. Furthermore, recombinant MYB-NFIB fusion displayed sustained protein stability and impacted transcriptional activities of interferon-associated genes set as compared to a wild type MYB. In vivo tumor formation analysis indicated the capacity of MYB-NFIB fusion cells to grow as implanted tumors, although there were no fusion-mediated growth advantages. This expression system may be useful not only in studies to determine the functional aspects of MYB-NFIB fusion but also in evaluating effective drug response in vitro and in vivo settings.
Ionizing radiation is a well-appreciated health risk, precipitant of DNA damage, and contributor to DNA methylation variability. Nevertheless, relationships of ionizing radiation with DNA methylation-based markers of biological age (i.e. epigenetic clocks) remain poorly understood. Using existing data from human bronchial epithelial cells, we examined in vitro relationships of three epigenetic clock measures (Horvath DNAmAge, MiAge, and epiTOC2) with galactic cosmic radiation (GCR), which is particularly hazardous due to its high linear energy transfer (LET) heavy-ion components. High-LET 56Fe was significantly associated with accelerations in epiTOC2 (β = 192 cell divisions, 95% CI: 71, 313, p-value = .003). We also observed a significant, positive interaction of 56Fe ions and time-in-culture with epiTOC2 (95% CI: 42, 441, p-value = .019). However, only the direct 56Fe ion association remained statistically significant after adjusting for multiple hypothesis testing. Epigenetic clocks were not significantly associated with high-LET 28Si and low-LET X-rays.Our results demonstrate sensitivities of specific epigenetic clock measures to certain forms of GCR. These findings suggest that epigenetic clocks may have some utility for monitoring and better understanding the health impacts of GCR.
Covariance matrices play fundamental roles in myriad statistical procedures. When the observations in a dataset far outnumber the features, asymptotic theory and empirical evidence have demonstrated the sample covariance matrix to be the optimal estimator of this parameter. This assertion does not hold when the number of observations is commensurate with or smaller than the number of features. Consequently, statisticians have derived many novel covariance matrix estimators for the high-dimensional regime, often relying on additional assumptions about the parameter's structural characteristics (e.g., sparsity). While these estimators have greatly improved the ability to estimate covariance matrices in high-dimensional settings, objectively selecting the best estimator from among the many possible candidates remains a largely unaddressed challenge. The cvCovEst package addresses this methodological gap through its implementation of a cross-validated framework for covariance matrix estimator selection. This data-adaptive procedure's selections are asymptotically optimal under minimal assumptions -in fact, they are equivalent to the selections that would be made if given full knowledge of the true data-generating processes (i.e., an oracle selector) (van der Laan & Dudoit, 2003).
Background: Researchers need visualization methods (using statistical and interactive techniques) to efficiently perform quality assessments and glean insights from their data. Data on networks can particularly benefit from more advanced techniques since typical visualization methods, such as node-link diagrams, can be difficult to interpret. We use heatmaps and consensus clustering on network data and show they can be combined to easily and efficiently explore nonparametric relationships among the variables and networks that comprise an ego network data set. Methods: We used ego network data from the Québec Adipose and Lifestyle Investigation in Youth (QUALITY) cohort used to evaluate this method. The data consists of 35 networks centered on individuals (egos), each containing a maximum of 10 nodes (alters). These networks are described through 41 variables: 11 describing the ego (e.g. fat mass percentage), 18 describing the alters (e.g. frequency of physical activity) and 12 describing the network structure (e.g. degree). Results: Four stable clusters were detected. Cluster one consisted of variables relating to the interconnectivity of the ego networks and the locations of interaction, cluster two consisted of the ego’s age, cluster three contained lifestyle variables and obesity outcomes and cluster four was comprised of variables measuring alter importance and diet. Conclusions: This exploratory method using heatmaps and consensus clustering on network data identified several important associations among variables describing the alters’ lifestyle habits and the egos’ obesity outcomes. Their relevance has been identified by studies on the effect of social networks on childhood obesity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.