A deep learning framework combined with word embedding to identify DNA replication origins

Wu, Feng; Zhang, Chengjin

doi:10.1038/s41598-020-80670-x

Cited by 12 publications

(10 citation statements)

References 56 publications

(62 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this study, we considered triplets as words given their role in protein coding genes as codons as well as their possible roles in primordial RNA synthesis as discussed in the introduction. In addition, the use of triplets as words during language-based modeling of genome sequences by previous studies has demonstrated that they capture important biological information that can be used for different prediction tasks such as the inference of DNA replication origins [93], identification of enhancers [94] or transcription factor binding sites [95]. Therefore, to apply word2vec to the metagenomes, each species level metagenome sequence was split into non-overlapping triplets, starting from the 5'-end.…”

Section: Training Of the Language (Embedding) Modelsmentioning

confidence: 99%

“…• First each genome was split into non-overlapping triplets "TAT ACG GGG AAA ...". In previous biological sequence analysis tasks [93,94], overlapping K-mers were used during word embedding. However, in this study the use of overlapping K-mers would introduce a statistical artefact in examining the relationships between adjacent triplets since overlapping triplets obtained by a sliding through the sequence 1-nucleotide a time would be predictably different by 1 nucleotide.…”

Section: Training Of the Language (Embedding) Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Genomes contain relics of a triplet code connecting the origins of primordial RNA synthesis to the origins of genetically coded protein synthesis

Siwo

2021

Preprint

View full text Add to dashboard Cite

Life on earth relies on three types of information polymers-DNA, RNA and proteins. In all organisms and viruses, these molecules are synthesized by the copying of pre-existing templates. A triplet-based code known as the genetic code guides the synthesis of proteins by complex enzymatic machines that decode genetic information in RNA sequences. The origin of the genetic code is one of the most fundamental questions in biology. In this study, computational analysis of about 5,000 species level metagenomes using techniques for the analysis of human language suggests that the genomes of extant organisms contain relics of a distinct triplet code that potentially predates the genetic code. This code defines the relationship between adjacent triplets in DNA/RNA sequences, whereby these triplets predominantly differ by a single base. Furthermore, adjacent triplets encode amino acids that are thought to have emerged around the same period in the earth’s early history. The results suggest that the order of triplets in primordial RNA sequences was associated with the availability of specific amino acids, perhaps due to a coupling of a triplet-based primordial RNA synthesis mechanism to a primitive mechanism of peptide bond formation. Together, this coupling could have given rise to early nucleic acid sequences and a system for encoding amino acid sequences in RNA, i.e. the genetic code. Thus, the central role of triplets in biology potentially extends to the primordial world, contributing to both the origins of genomes and the origins of genetically coded protein synthesis.SignificanceOne of the most intriguing discoveries in biology is that the order of amino acids in each protein is determined by the order of nucleotides (commonly represented by the letters A, U, G, C) in a biological molecule known as RNA. The genetic code serves as a dictionary that maps each of the 64 triplets ‘words’ in RNA to the 20 amino acids, thereby specifying how information encoded in RNA is decoded into sequences of amino acids (i.e., proteins). The deciphering of the genetic code was one of the greatest discoveries of the 20th century (1968 Nobel Prize in Medicine and Physiology) and is central to modern molecular biology. Yet, how it came to be that the order of triplets in RNA encodes the sequence of the protein synthesized remains one of the most important enigmas of biology. Paradoxically, in all life forms proteins cannot be synthesized without RNA and RNA itself cannot also be synthesized without proteins, presenting a chicken and egg dilemma. By analyzing thousands of microbial genomes using approaches drawn from the field of natural language processing, this study finds that the order of triplets across genomes contains relics of an ancient triplet code, distinct from but closely connected to the genetic code. Unlike the genetic code which specifies the relationship between information in RNA and the sequence of proteins, this ancient code describes the relationship between adjacent triplets in extant genome sequences, whereby such triplets are often different from each other by a single letter. Triplets that are closely related by this ancient code encode amino acids that are thought to have emerged around the same period in the earth’s early history. In other words, a fossil record of the chronological order of appearance of amino acids on early earth appears written in genome sequences. This potentially demonstrates that the process by which RNA sequences were synthesized in the primordial world relied on triplets and was coupled to amino acids available at the time. Hence, the connections between primordial RNA synthesis and a primitive mechanism for linking amino acids to form peptides could have enabled one type of molecule (RNA) to code for the other (protein), facilitating the emergence of the genetic code.

show abstract

Section: Training Of the Language (Embedding) Modelsmentioning

confidence: 99%

Section: Training Of the Language (Embedding) Modelsmentioning

confidence: 99%

Genomes contain relics of a triplet code connecting the origins of primordial RNA synthesis to the origins of genetically coded protein synthesis

Siwo

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Typically, in language models, word embeddings represent the latent space (dimensions) of a corpus of text (Mikolov et al, 2013) and can capture highly nonlinear and contextual relationships. Codons (tri-nucleotides, 3-mers) translations represent a natural basis for word representations and have been utilized in the past for learning embedding models for modeling various outcomes such as mutation susceptibility (Yilmaz, 2020) and gene sequence correlations (Wu et al, 2021). Recently, Hie et.…”

Section: Introductionmentioning

confidence: 99%

“…However, there is a paucity of studies that explore the use of unsupervised embeddings for machine learning based prediction of surges in infections. In these models, codons (tri-nucleotides, 3-mers) translations represent a natural basis for word representations and have been utilized in the past for learning embedding models for modelling various outcomes such as mutation susceptibility and gene sequence correlations (Yilmaz, 2020) (Wu et al, 2021). Recently, Hie et.…”

Section: Introductionmentioning

confidence: 99%

“…Codons with their tri-nucleotide translations represent a biological basis for word representations. They have been utilized to learn embeddings for modeling various outcomes such as mutation susceptibility [6] and gene sequence correlations [7]. Our empirical experiments with the learned embeddings uncovered explainable genomic signals and predicted new caseloads across nine countries.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning

Nagpal

Pal

Ashima

et al. 2021

Preprint

View full text Add to dashboard Cite

The global efforts to control COVID-19 are threatened by the rapid emergence of novel variants that may display undesirable characteristics such as immune escape or increased pathogenicity. The current approaches to genomic surveillance do not allow early prediction of emerging variations. Here, we derive Dimensions of Concern (DoC) in the latent space of SARS-CoV-2 mutations and demonstrate their potential to provide a lead time for predicting the increase of new cases in 9 countries across the globe. We learned unsupervised word embeddings from 3,09,060 spike protein coding sequences deposited on GISAID database until April, 2021. We discovered that "blips" in the latent dimensions of embeddings are associated with mutations. We modeled the temporal occurrence of blips and their relationships with the number of new cases in the following months for these countries. Certain dimensions demonstrated a consistent leading relationship between the occurrence of blips and the number of new cases in the following months, thus labeled as potential Dimensions of Concern, DoCs. We validated the predictive importance of DoCs by performing Random Forest-based feature selection and modeling in a temporally split training, validation, testing regime. Twelve dimensions achieved statistical significance and achieved an R-squared of 37% for prediction of number of new cases in the following month. Biological exploration of DoCs revealed that dimensions 3 and 12 captures 3-mers CGG, ACG and CAC that are associated with known variants L452R, K417T and Q677H respectively. Learning and tracking DoCs is extensible to related challenges such as pandemic preparedness, immune escape, pathogenicity modeling and antimicrobial resistance.

show abstract

Preliminary Results of Group Detection Technique Based on User to Vector Encoding

Biondi

Franzoni

Milani

2023

Computational Science and Its Applications – ICCSA 2023 Workshops

View full text Add to dashboard Cite

A deep learning framework combined with word embedding to identify DNA replication origins

Cited by 12 publications

References 56 publications

Genomes contain relics of a triplet code connecting the origins of primordial RNA synthesis to the origins of genetically coded protein synthesis

Genomes contain relics of a triplet code connecting the origins of primordial RNA synthesis to the origins of genetically coded protein synthesis

Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning

Preliminary Results of Group Detection Technique Based on User to Vector Encoding

Contact Info

Product

Resources

About