the original research plans; F.M. and P.W. conducted the bulk of the computational analysis; B.M. assisted with the machine learning models; P.W. and S.-H.S. wrote the manuscript with contributions of all the authors; S.-H.S. agrees to serve as the author responsible for contact and ensures communication.
AbstractAvailability of genome sequences has led to significant progress in biological sciences and beyond. With few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies. While extremely useful, the short-read coverages across the assemblies are highly uneven, indicative of sequencing and assembly issues. To assess the underlying causes of such uneven read coverage, we used the tomato genome as an example and integrated multiple sequence features to establish machine learning models capable of predicting whether a genomic region has significantly high or low read coverage. Importantly, 0.6% (5.1Mb) and 9.7% (79.6Mb) of tomato genome assembly had significantly higher and lower coverage compared to background, respectively. By evaluating features important for the prediction, we found that GC content and high density of transposon elements are the major contributors to break points in an assembly, leading to gaps filled with Ns and the resulting low read coverages. In contrast, simple sequence repeats and tandemly duplicated genes, especially specialized metabolism genes, tend to be mis-assembled, resulting in high read coverages. We also present evidence of a misassembled regions containing tandemly duplicated specialized metabolism genes. The presence of variable coverage regions is expected to significantly impact genome-wide studies, highlighting the need to detect them in short-read based assemblies.