Pierre-Auŕelien Gilliot scite author profile

Current Opinion in Chemical Biology

2020

The ability to read and quantify nucleic acids such as DNA and RNA using sequencing technologies has revolutionized our understanding of life. With the emergence of synthetic biology, these tools are now being put to work in new waysenabling de novo biological design. Here, we show how sequencing is supporting the creation of a new wave of biological parts and systems, as well as providing the vast data sets needed for the machine learning of design rules for predictive bioengineering. However, we believe this is only the tip of the iceberg and end by providing an outlook on recent advances that will likely broaden the role of sequencing in synthetic biology and its deployment in real-world environments. Highlights • Sequencing can capture detailed information about diverse biological processes. • Synthetic biology is beginning to exploit sequencing to aid design. • Large sequencing datasets are powering new machine learning approaches in biology. • Emerging trends will see the application of sequencing in synthetic biology grow.

Sequencing Enabling Design and Learning in Synthetic Biology

Gilliot¹,

Gorochowski²

2020

Preprint

1The ability to read and quantify nucleic acids such as DNA and RNA using sequencing 2 technologies has revolutionized our understanding of life. With the emergence of synthetic 3 biology, these tools are now being put to work in new ways -enabling de novo biological 4 design. Here, we show how sequencing is supporting the creation of a new wave of biological 5 parts and systems, as well as providing the vast data sets needed for the machine learning of 6 design rules for predictive bioengineering. However, we believe this is only the tip of the 7 iceberg and end by providing an outlook on recent advances that will likely broaden the role 8 of sequencing in synthetic biology and its deployment in real-world environments. 9 Highlights 11• Sequencing can capture detailed information about diverse biological processes. 12• Synthetic biology is beginning to exploit sequencing to aid design. 13 • Large sequencing datasets are powering new machine learning approaches in biology. 14 • Emerging trends will see the application of sequencing in synthetic biology grow. 15 Preprints (www.preprints.org) | NOT PEER-REVIEWED | 4 few decades. This has resulted in DNA-seq becoming the go to method for part discovery, 52 allowing for genetic information to be extracted from virtually any environment and organism, 53 including those not even culturable in the lab [17]. 54 While DNA-seq is able to uncover sequences that encode biological parts, it does not 55 capture any information about how they might perform. For parts controlling gene expression, 56 such information is vital because precise levels of expression are often required for a device 57 or system to function correctly. Computational models have been developed to try and bridge 58 this gap [18,19], but their reliability is questionable when used outside of model organisms like 59 Escherichia coli. For some key parts, such as transcriptional promoters and terminators, RNA 60 sequencing (RNA-seq) can be used to measure part performance directly, providing a 61 snapshot of RNA abundance at a point in time [2]. Furthermore, RNA-seq is able to 62 characterize all promoters and terminators present in a cell simultaneously, if the transcripts 63 produced are unique [15,20]. More recently, a system for DNA Regulatory Element Analysis 64 by Cell-Free Transcription and Sequencing (DRAFTS) was developed to enable rapid high-65throughput measurements of regulatory sequences controlling transcription [21]. This method 66 brings together cell-free expression systems with multiplexed RNA-seq to allow for in vitro 67 characterization of regulatory parts in a wide range of different organisms. This approach has 68 been shown to display a good correlation with in vivo part performance and will be able to 69 expand not only the number of parts available to bioengineers, but also provide crucial 70 information about which non-model organisms they can be effectively used in. 71 72 Design and optimization of genetic and molecular parts 73 Biological parts taken directly f...

Effective design and inference for cell sorting and sequencing based massively parallel reporter assays

2022

Preprint

The ability to measure the phenotype of millions of different genetic designs using Massively Parallel Reporter Assays (MPRAs) has revo- lutionised our understanding of genotype-to-phenotype relationships and opened avenues for data-centric approaches to biological design. However, our knowledge of how best to design these costly experiments and the effect that our choices have on the quality of the data produced is lacking. Here, we tackle this issue by developing FORECAST, a Python package that supports the accurate simulation of cell-sorting and sequencing based MPRAs and robust maximum likelihood based inference of genetic design function from MPRA data. We use FORECAST capabilities to reveal rules for MPRA experimental design that help ensure accurate genotype-to-phenotype links and show how the simulation of MPRA experiments can help us better understand the limits of prediction accuracy when this data is used for training deep learning based classifiers. As the scale and scope of MPRAs grows, tools like FORECAST will help ensure we make informed decisions during their development and the most of the data produced.

Transfer learning for cross-context prediction of protein expression from 5’UTR sequence

2023

Preprint

Model-guided DNA sequence design can accelerate the reprogramming of living cells. It allows us to engineer more complex biological systems by removing the need to physically assemble and test each potential design. While mechanistic models of gene expression have seen some success in supporting this goal, data-centric, deep learning-based approaches often provide more accurate predictions. This accuracy, however, comes at a cost - a lack of generalisation across genetic and experimental contexts, which has limited their wider use outside the context in which they were trained. Here, we address this issue by demonstrating how a simple transfer learning procedure can effectively tune a pre-trained deep learning model to predict protein translation rate from 5' untranslated region sequence (5'UTR) for diverse contexts in Escherichia coli using a small number of new measurements. This allows for important model features learnt from expensive massively parallel reporter assays to be easily transferred to new settings. By releasing our trained deep learning model and complementary calibration procedure, this study acts as a starting point for continually refined model-based sequence design that builds on previous knowledge and future experimental efforts.

Effective design and inference for cell sorting and sequencing based massively parallel reporter assays

2023

Motivation The ability to measure the phenotype of millions of different genetic designs using Massively Parallel Reporter Assays (MPRAs) has revolutionised our understanding of genotype-to-phenotype relationships and opened avenues for data-centric approaches to biological design. However, our knowledge of how best to design these costly experiments and the effect that our choices have on the quality of the data produced is lacking. Results In this article, we tackle the issues of data quality and experimental design by developing FORECAST, a Python package that supports the accurate simulation of cell-sorting and sequencing based MPRAs and robust maximum likelihood based inference of genetic design function from MPRA data. We use FORECAST's capabilities to reveal rules for MPRA experimental design that help ensure accurate genotype-to-phenotype links and show how the simulation of MPRA experiments can help us better understand the limits of prediction accuracy when this data is used for training deep learning based classifiers. As the scale and scope of MPRAs grows, tools like FORECAST will help ensure we make informed decisions during their development and the most of the data produced. Availability and implementation The FORECAST package is available at: https://gitlab.com/Pierre-Aurelien/forecast. Code for the deep learning analysis performed in this study is available at: https://gitlab.com/Pierre-Aurelien/rebeca. Supplementary information Supplementary data are available at Bioinformatics online.