MuLan-Methyl - Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction

Zeng, Wenhuan; Gautam, Anupam; Huson, Daniel H.

doi:10.1101/2023.01.04.522704

Cited by 5 publications

(11 citation statements)

References 61 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Accuracy 23 refers to the proportion of correct predictions with respect to the total predictions. Specificity or true negative rate (TNR) 20 is the model’s ability to correctly predict the negative class samples.…”

Section: Methodsmentioning

confidence: 99%

“…It is determined by dividing the number of correct negative predictions by the total number of true negatives. Sensitivity (or recall) 23 measures the ability of the model to predict positive class samples by taking the ratio of correct positive predictions to the predictions on positive samples. MCC 78 calculates the correlation between the model predictions and the true class, by taking into consideration true positives, true negatives, false positives, and false negatives.…”

Section: Methodsmentioning

confidence: 99%

“…The limitations of wet lab based methods and exceptional performance of AI based applications in natural language processing (NLP) tasks, have prompted a marathon of developing AI methods for DNA sequence analysis. Several AI models have been developed for various DNA analysis tasks such as enhancer identification 20,21 , DNA modification prediction 22,23 , promoter prediction 24 , DNA cyclizability prediction 25 , nucleosome position detection 26 and so on. On the other hand, the identification of eccDNA is still being performed through wet lab-based methods due to the deficiency of AI applications for this particular task.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

iLEC-DNA: Identifying Long Extra-chromosomal Circular DNA by Fusing Sequence-derived Features of Physicochemical Properties and Nucleotide Distribution Patterns

Abbasi,

Asim,

Dengel

et al. 2023

Preprint

View full text Add to dashboard Cite

Long extrachromosomal circular DNA (leccDNA) regulates several biological processes such as genomic instability, gene amplification, and oncogenesis. The identification of leccDNA holds significant importance to investigate its potential associations with cancer, autoimmune, cardiovascular, and neurological diseases. In addition, understanding these associations can provide valuable insights about disease mechanisms and potential therapeutic approaches. Conventionally, wet lab-based methods are utilized to identify leccDNA, which are hindered by the need for prior knowledge, and resource intensive processes, potentially limiting their broader applicability. To empower the process of leccDNA identification across multiple species, the paper in hand presents the very first computational predictor. The proposed iLEC-DNA predictor makes use of SVM classifier along with sequence-derived nucleotide distribution patterns and physicochemical properties-based features. In addition, the study introduces a set of 12 benchmark leccDNA datasets related to three species, namely HM, AT, and YS. It performs large-scale experimentation across 12 benchmark datasets under different experimental settings using the proposed predictor and more than 140 baseline predictors. The proposed predictor outperforms baseline predictors across diverse leccDNA datasets by producing average performance values of 80.699%, 61.45% and 80.7% in terms of ACC, MCC and AUC-ROC across all the datasets.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

iLEC-DNA: Identifying Long Extra-chromosomal Circular DNA by Fusing Sequence-derived Features of Physicochemical Properties and Nucleotide Distribution Patterns

Abbasi,

Asim,

Dengel

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…It is determined by dividing the number of correct negative predictions by the total number of true negatives. Sensitivity (or recall) 23 In the mathematical expression above, T + and T − denote the true predictions related to positive and negative classes, whereas F + and F − are the incorrect predictions related to the positive and negative classes respectively.…”

Section: Evaluation Measuresmentioning

confidence: 99%

iLEC-DNA: Identifying Long Extra-chromosomal Circular DNA by Fusing Sequence-derived Features of Physicochemical Properties and Nucleotide Distribution Patterns

Abbasi,

Asim,

Dengel

et al. 2023

Preprint

View full text Add to dashboard Cite

Long extrachromosomal circular DNA (leccDNA) regulates several biological processes such as genomic instability, gene amplification, and oncogenesis. The identification of leccDNA holds significant importance to investigate its potential associations with cancer, autoimmune, cardiovascular, and neurological diseases. In addition, understanding these associations can provide valuable insights about disease mechanisms and potential therapeutic approaches. Conventionally , wet lab-based methods are utilized to identify leccDNA, which are hindered by the need for prior knowledge, and resource-intensive processes, potentially limiting their broader applicability. To empower the process of leccDNA identification across multiple species, the paper in hand presents the very first computational predictor. The proposed iLEC-DNA predictor makes use of SVM classifier along with sequence-derived nucleotide distribution patterns and physico-chemical properties-based features. In addition, the study introduces a set of 12 benchmark leccDNA datasets related to three species, namely Homo sapiens (HM), Arabidopsis Thaliana (AT), and Saccharomyces cerevisiae (SC/YS). It performs large-scale experimentation across 12 benchmark datasets under different experimental settings using the proposed predictor and more than 140 baseline predictors. The proposed predictor outperforms baseline predictors across diverse leccDNA datasets by producing average performance values of 80.699%, 61.45% and 80.7% in terms of ACC, MCC and AUC-ROC across all the datasets. The source code of the proposed and baseline predictors is available at https://github.com/FAhtisham/Extrachrosmosomal-DNA-Prediction. To facilitate the scientific community, a web application for leccDNA identification is available at https://sds_genetic_analysis.opendfki.de/iLEC_DNA//.

show abstract

“…For example, self-supervised tasks such as masked language modelling (MLM) have recently been used to pretrain genomic sequence embeddings that are then fine-tuned for downstream tasks (e.g. Ji et al (2021); Mo et al (2021); Benegas et al (2022); Zeng et al (2023)). Pretraining using task-relevant data can improve the performance of fine-tuned models (Gururangan et al, 2020), while pretraining using irrelevant data can hurt performance (Liu et al, 2022).…”

Section: Introductionmentioning

confidence: 99%

Strategies for effectively modelling promoter-driven gene expression using transfer learning

Reddy

Herschl

Kolli

et al. 2023

Preprint

View full text Add to dashboard Cite

Advances in gene delivery technologies are enabling rapid progress in molecular medicine, but require precise expression of genetic cargo in desired cell types, which is predominantly achieved via a regulatory DNA sequence called a promoter; however, only a handful of cell type-specific promoters are known. Efficiently designing compact promoter sequences with a high density of regulatory information by leveraging machine learning models would therefore be broadly impactful for fundamental research and direct therapeutic applications. However, models of expression from such compact promoter sequences are lacking, despite the recent success of deep learning in modelling expression from endogenous regulatory sequences. Despite the lack of large datasets measuring promoter-driven expression in many cell types, data from a few well-studied cell types or from endogenous gene expression may provide relevant information for transfer learning, which has not yet been explored in this setting. Here, we evaluate a variety of pretraining tasks and transfer strategies for modelling cell type-specific expression from compact promoters and demonstrate the effectiveness of pretraining on existing promoter-driven expression datasets from other cell types. Our approach is broadly applicable for modelling promoter-driven expression in any data-limited cell type of interest, and will enable the use of model-based optimization techniques for promoter design for gene delivery applications. Our code and data are available at https://github.com/anikethjr/promoter_models.

show abstract

MuLan-Methyl - Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction

Cited by 5 publications

References 61 publications

iLEC-DNA: Identifying Long Extra-chromosomal Circular DNA by Fusing Sequence-derived Features of Physicochemical Properties and Nucleotide Distribution Patterns

iLEC-DNA: Identifying Long Extra-chromosomal Circular DNA by Fusing Sequence-derived Features of Physicochemical Properties and Nucleotide Distribution Patterns

iLEC-DNA: Identifying Long Extra-chromosomal Circular DNA by Fusing Sequence-derived Features of Physicochemical Properties and Nucleotide Distribution Patterns

Strategies for effectively modelling promoter-driven gene expression using transfer learning

Contact Info

Product

Resources

About