Over the past decades, remarkable progress on phosphoramidite chemistry-based large-scale de novo oligonucleotide synthesis has been achieved, enabling numerous novel and exciting applications. Among them, de novo genome synthesis and DNA data storage are striking. However, to make these two applications more practical, the synthesis length, speed, cost, and throughput require vast improvements, which is a challenge to be met by the phosphoramidite chemistry. Harnessing the power of enzymes, the recently emerged enzymatic methods provide a competitive route to overcome this challenge. In this review, we first summarize the status of large-scale oligonucleotide synthesis technologies including the basic methodology and large-scale synthesis approaches, with special focus on the emerging enzymatic methods. Afterward, we discuss the opportunities and challenges of large-scale oligonucleotide synthesis on de novo genome synthesis and DNA data storage respectively.
DNA data storage is a rapidly developing technology with great potential due to its high density, long-term durability, and low maintenance cost. The major technical challenges include various errors, such as strand breaks, rearrangements, and indels that frequently arise during DNA synthesis, amplification, sequencing, and preservation. In this study, a de novo strand assembly algorithm (DBGPS) is developed using de Bruijn graph and greedy path search to meet these challenges. DBGPS shows substantial advantages in handling DNA breaks, rearrangements, and indels. The robustness of DBGPS is demonstrated by accelerated aging, multiple independent data retrievals, deep error-prone PCR, and large-scale simulations. Remarkably, 6.8 MB of data is accurately recovered from a severely corrupted sample that has been treated at 70 °C for 70 days. With DBGPS, we are able to achieve a logical density of 1.30 bits/cycle and a physical density of 295 PB/g.
High density and long-term features make DNA data storage a potential media. However, DNA data channel is a unique channel with unavoidable ‘data reputations’ in the forms of multiple error-rich strand copies. This multi-copy feature cannot be well harnessed by available codec systems optimized for single-copy media. Furthermore, lacking an effective mechanism to handle base shift issues, these systems perform poorly with indels. Here, we report the efficient reconstruction of DNA strands from multiple error-rich sequences directly, utilizing a De Bruijn Graph-based Greedy Path Search (DBG-GPS) algorithm. DBG-GPS can take advantage of the multi-copy feature for efficient correction of indels as well as substitutions. As high as 10% of errors can be accurately corrected with a high coding rate of 96.8%. Accurate data recovery with low quality, deep error-prone PCR products proved the high robustness of DBG-GPS (314Kb, 12K oligos). Furthermore, DBG-GPS shows 50 times faster than the clustering and multiple alignment-based methods reported. The revealed linear decoding complexity makes DBG-GPS a suitable solution for large-scale data storage. DBG-GPS’s capacity with large data was verified by large-scale simulations (300 MB). A Python implementation of DBG-GPS is available at https://switch-codes.coding.net/public/switch-codes/DNA-Fountain-De-Bruijn-Decoding/git/files.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.