Genetically modified genomes are often used today in many areas of fundamental and applied research. In many studies, coding or noncoding regions are modified in order to change protein sequences or gene expression levels. Modifying one or several nucleotides in a genome can also lead to unexpected changes in the epigenetic regulation of genes. When designing a synthetic genome with many mutations, it would thus be very informative to be able to predict the effect of these mutations on chromatin. We develop here a deep learning approach that quantifies the effect of every possible single mutation on nucleosome positions on the full Saccharomyces cerevisiae genome. This type of annotation track can be used when designing a modified S. cerevisiae genome. We further highlight how this track can provide new insights on the sequence-dependent mechanisms that drive nucleosomes’ positions in vivo.
Application of deep neural network is a rapidly expanding field now reaching many disciplines including genomics. In particular, convolutional neural networks have been exploited for identifying the functional role of short genomic sequences. These approaches rely on gathering large sets of sequences with known functional role, extracting those sequences from whole-genome-annotations. These sets are then split into learning, test and validation sets in order to train the networks. While the obtained networks perform well on validation sets, they often perform poorly when applied on whole genomes in which the ratio of positive over negative examples can be very different than in the training set. We here address this issue by assessing the genome-wide performance of networks trained with sets exhibiting different ratios of positive to negative examples. As a case study, we use sequences encompassing gene starts from the RefGene database as positive examples and random genomic sequences as negative examples. We then demonstrate that models trained using data from one organism can be used to predict gene-start sites in a related species, when using training sets providing good genome-wide performance. This cross-species application of convolutional neural networks provides a new way to annotate any genome from existing high-quality annotations in a related reference species. It also provides a way to determine whether the sequence motifs recognised by chromatin-associated proteins in different species are conserved or not.
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Genomic sequences co-evolve with DNA-associated proteins to ensure the orderly folding of long DNA molecules into functional chromosomes. In eukaryotes, this multiscale folding involves several molecular complexes and structures, ranging from nucleosomes to large cohesin-mediated DNA loops. To directly explore the causal relationships between the DNA sequence composition and the spontaneous loading and activity of these complexes in the absence of co-evolution, we used and characterized yeast strains carrying exogenous bacterial chromosomes that diverged from eukaryotic sequences over 1.5 billion years ago. By combining this synthetic approach with deep learning-based in silico analysis, we show that sequence composition drives chromatin assembly, transcriptional activity, folding, and compartmentalization in this cellular context. These results are also a step forward in understanding the molecular events at play following natural horizontal gene transfers, and could also be considered in synthetic genomic engineering projects.
SUMMARY Prediction of genomic annotations from DNA sequences using deep learning is today becoming a flourishing field with many applications. Nevertheless, there are still difficulties in handling data in order to conveniently build and train models dedicated for specific end-user’s tasks. keras_dna is designed for an easy implementation of Keras models (TensorFlow high level API) for genomics. It can handle standard bioinformatic files formats as inputs such as bigwig, gff, bed, wig, bedGraph, or fasta and returns standardized inputs for model training. keras_dna is designed to implement existing models but also to facilitate the development of news models that can have single or multiple targets or inputs. Availability Freely available with a MIT License using pip install keras_dna or cloning the github repo at https://github.com/etirouthier/keras_dna.git Contact julien.mozziconacci@mnhn.fr and etienne.routhier@upmc.fr Supplementary information An extensive documentation can be found online at https://keras-dna.readthedocs.io/en/latest/
Deep neural network application is today a skyrocketing field in many disciplinary domains. In genomics the development of deep neural networks is expected to revolutionize current practice. Several approaches relying on convolutional neural networks have been developed to associate short genomic sequences with a functional role such as promoters, enhancers or protein binding sites along genomes. These approaches rely on the generation of sequences batches with known annotations for learning purpose. While they show good performance to predict annotations from a test subset of these batches, they usually perform poorly when applied genome-wide. In this study, we address this issue and propose an optimal strategy to train convolutional neural networks for this specific application. We use as a case study transcription start sites and show that a model trained on one organism can be used to predict transcription start sites in a different specie. This cross-species application of convolutional neural networks trained with genomic sequence data provides a new technique to annotate any genome from previously existing annotations in related species. It also provides a way to determine whether the sequence patterns recognized by chromatin associated proteins in different species are conserved or not. 7/1110/11
The so-called 601 DNA sequence is often used to constrain the position of nucleosomes on a DNA molecule in vitro. Although the ability of the 147 base pair sequence to precisely position a nucleosome in vitro is well documented, in vivo application of this property has been explored only in a few studies and yielded contradictory conclusions. Our goal in the present study was to test the ability of the 601 sequence to dictate nucleosome positioning in Saccharomyces cerevisiae in the context of a long tandem repeat array inserted in a yeast chromosome. We engineered such arrays with three different repeat size, namely 167, 197 and 237 base pairs. Although our arrays are able to position nucleosomes in vitro as expected, analysis of nucleosome occupancy on these arrays in vivo revealed that nucleosomes are not preferentially positioned as expected on the 601-core sequence along the repeats and that the measured nucleosome repeat length does not correspond to the one expected by design. Altogether our results demonstrate that the rules defining nucleosome positions on this DNA sequence in vitro are not valid in vivo, at least in this chromosomal context, questioning the relevance of using the 601 sequence in vivo to achieve precise nucleosome positioning on designer synthetic DNA sequences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.