Soobok Joe scite author profile

DNA methylation patterns have been shown to change throughout the normal aging process. Several studies have found epigenetic aging markers using age predictors, but these studies only focused on blood-specific or tissue-common methylation patterns. Here, we constructed nine tissue-specific age prediction models using methylation array data from normal samples. The constructed models predict the chronological age with good performance (mean absolute error of 5.11 years on average) and show better performance in the independent test than previous multi-tissue age predictors. We also compared tissue-common and tissue-specific aging markers and found that they had different characteristics. Firstly, the tissue-common group tended to contain more positive aging markers with methylation values that increased during the aging process, whereas the tissue-specific group tended to contain more negative aging markers. Secondly, many of the tissue-common markers were located in Cytosine-phosphate-Guanine (CpG) island regions, whereas the tissue-specific markers were located in CpG shore regions. Lastly, the tissue-common CpG markers tended to be located in more evolutionarily conserved regions. In conclusion, our prediction models identified CpG markers that capture both tissue-common and tissue-specific characteristics during the aging process.

show abstract

Microbiome Preterm Birth DREAM Challenge: Crowdsourcing Machine Learning Approaches to Advance Preterm Birth Research

Golob

Oskotsky

Tang

et al. 2023

Preprint

View full text Add to dashboard Cite

Globally, every year about 11% of infants are born preterm, defined as a birth prior to 37 weeks of gestation, with significant and lingering health consequences. Multiple studies have related the vaginal microbiome to preterm birth. We present a crowdsourcing approach to predict: (a) preterm or (b) early preterm birth from 9 publicly available vaginal microbiome studies representing 3,578 samples from 1,268 pregnant individuals, aggregated from raw sequences via an open-source tool, MaLiAmPi. We validated the crowdsourced models on novel datasets representing 331 samples from 148 pregnant individuals. From 318 DREAM challenge participants we received 148 and 121 submissions for our two separate prediction sub-challenges with top-ranking submissions achieving bootstrapped AUROC scores of 0.69 and 0.87, respectively. Alpha diversity, VALENCIA community state types, and composition (via phylotype relative abundance) were important features in the top performing models, most of which were tree based methods. This work serves as the foundation for subsequent efforts to translate predictive tests into clinical practice, and to better understand and prevent preterm birth.

show abstract

Prognostic factor analysis for breast cancer using gene expression profiles

Joe

Nam

2016

BMC Med Inform Decis Mak

View full text Add to dashboard Cite

BackgroundThe survival of patients with breast cancer is highly sporadic, from a few months to more than 15 years. In recent studies, the gene expression profiling of tumors has been used as a promising means of predicting prognosis factors.MethodsIn this study, we used gene expression datasets of tumors to identify prognostic factors in breast cancer. We conducted log-rank tests and used unsupervised clustering methods to find reciprocally expressed gene sets associated with worse survival rates. Prognosis prediction scores were determined as the ratio of gene expressions.ResultsAs a result, four prognosis prediction gene set modules were constructed. The four prognostic gene sets predicted worse survival rates in three independent gene expression data sets. In addition, we found that cancer patient with poor prognosis, i.e., triple-negative cancer, HER2-enriched, TP53 mutated and high-graded patients had higher prognosis prediction scores than those with other types of breast cancer.ConclusionsIn conclusion, based on a gene expression analysis, we suggest that our well-defined scoring method of the prediction of survival outcome may be useful for developing prognostic factors in breast cancer.Electronic supplementary materialThe online version of this article (doi:10.1186/s12911-016-0292-5) contains supplementary material, which is available to authorized users.

show abstract

Prediction model construction of mouse stem cell pluripotency using CpG and non-CpG DNA methylation markers

Joe

Nam

2020

BMC Bioinformatics

View full text Add to dashboard Cite

Background: Genome-wide studies of DNA methylation across the epigenetic landscape provide insights into the heterogeneity of pluripotent embryonic stem cells (ESCs). Differentiating into embryonic somatic and germ cells, ESCs exhibit varying degrees of pluripotency, and epigenetic changes occurring in this process have emerged as important factors explaining stem cell pluripotency. Results: Here, using paired scBS-seq and scRNA-seq data of mice, we constructed a machine learning model that predicts degrees of pluripotency for mouse ESCs. Since the biological activities of non-CpG markers have yet to be clarified, we tested the predictive power of CpG and non-CpG markers, as well as a combination thereof, in the model. Through rigorous performance evaluation with both internal and external validation, we discovered that a model using both CpG and non-CpG markers predicted the pluripotency of ESCs with the highest prediction performance (0.956 AUC, external test). The prediction model consisted of 16 CpG and 33 non-CpG markers. The CpG and most of the non-CpG markers targeted depletions of methylation and were indicative of cell pluripotency, whereas only a few non-CpG markers reflected accumulations of methylation. Additionally, we confirmed that there exists the differing pluripotency between individual developmental stages, such as E3.5 and E6.5, as well as between induced mouse pluripotent stem cell (iPSC) and somatic cell. Conclusions: In this study, we investigated CpG and non-CpG methylation in relation to mouse stem cell pluripotency and developed a model thereon that successfully predicts the pluripotency of mouse ESCs.

show abstract

Identification of a Specific Base Sequence of Pathogenic E. Coli through a Genomic Analysis

Joe

Nam

2014

View full text Add to dashboard Cite

E. coli sequence type 131 (ST131) is one of pathogens that causes resistant infections. Comparative genome analyses allow interpretations of the virulence factors of pathogens. Thus, in this study, we analysis the genomic differences between the pathogenic E. coli ST131 and the non-pathogenic E. coli K-12. In this study, we identify the genomic differences between 96 E. coli ST131 strains and the E. coli K-12 in gene elements and their non-coding regulation elements. Using next-generation whole-genome sequencing data, we investigated genetic variations of proteincoding regions and their regulation regions. After the alignment of the sequence reads, large numbers of single nucleotide variants (SNVs) were observed in the regulation and protein-coding sequences. In the regulation regions, we found strong conserved regions, in this case, ribosome binding sites. In the gene regions, we found conserved start and stop codons with the specific position varying commonly in each codon. Except for these wellconserved regions, other variations were randomly distributed in regulation regions. Even a region having well-known conserved sequences such as -10 and -35 in the promoter had a similar level of variation. In this study, we found genomic variations between the pathogenic E. coli ST 131 strain and the non-pathogenic E. coli K-12. In addition, the numbers of sequence variations were determined in both the protein-coding regions and the regulation regions. However, we found that the effects of variations on the protein-coding regions are less significant than those on the regulation regions.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Soobok Joe

Development of Tissue-Specific Age Predictors Using DNA Methylation Data

Microbiome Preterm Birth DREAM Challenge: Crowdsourcing Machine Learning Approaches to Advance Preterm Birth Research

Prognostic factor analysis for breast cancer using gene expression profiles

Prediction model construction of mouse stem cell pluripotency using CpG and non-CpG DNA methylation markers

Identification of a Specific Base Sequence of Pathogenic E. Coli through a Genomic Analysis

Contact Info

Product

Resources

About