14Background: Producing cost-effective haplotype-resolved personal genomes remains 15 challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, 16 has been demonstrated to facilitate de novo assembly of human genomes and variant detection.
17In this study, we investigate in depth how the parameter space of 10x library preparation and 18 sequencing affects assembly quality, on the basis of both simulated and real libraries.
19Findings: We prepared and sequenced eight 10x libraries with a diverse set of parameters from 20 standard cell lines NA12878 and NA24385 and performed whole genome assembly on the data.
21We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and 22 produce realistic simulated Linked-Read data sets. We found that assembly quality could be 23 improved by increasing the total sequencing coverage (C) and keeping physical coverage of 24 DNA fragments (C F ) or read coverage per fragment (C R ) within broad ranges. The optimal 25 physical coverage was between 332X and 823X and assembly quality worsened if it increased 26 to greater than 1,000X for a given C. Long DNA fragments could significantly extend phase 27 blocks, but decreased contig contiguity. The optimal length-weighted fragment length (Wߤ ி ) 28 was around 50 -150kb. When broadly optimal parameters were used for library preparation 29 and sequencing, ca. 80% of the genome was assembled in a diploid state.
30Conclusion: The Linked-Read libraries we generated and the parameter space we identified 31 provide theoretical considerations and practical guidelines for personal genome assemblies 32 based on 10x Linked-Read sequencing. 33 Keywords: 10x Linked-Read sequencing, de novo assembly, diploid human genome, library 34 preparation 35 3 Data description 36 Introduction 37 The human genome holds the key for understanding the genetic basis of human evolution, 38 hereditary illnesses and many phenotypes. Whole-genome reconstruction and variant discovery, 39 accomplished by analysis of data from whole-genome sequencing experiments, are 40 foundational for the study of human genomic variation and analysis of genotype-phenotype 41 relationships. Over the past decades, cost-effective whole-genome sequencing has been 42 revolutionized by short-fragment approaches, the most widespread of which have been the 43 consistently improving generations of the original Solexa technology [1, 2], now referred to as 44 Illumina sequencing. Illumina's strengths and weaknesses are inherent in the sample 45 preparation and sequencing chemistry. Illumina generates short paired reads (2x150 base pairs 46 for the highest-throughput platforms) from short fragments (usually 400-500 base pairs) [3]. 47 Because many clonally amplified molecules generate a robust signal during the sequencing 48 reaction, Illumina's average per-base error rates are very low.
50The lack of long-range contiguity between end-sequenced short fragments limits their 51 application for reconstructing personal genomes. Long-rang...