Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

Tørresen, Ole K.; Star, Bastiaan; Mier, Pablo; Andrade‐Navarro, Miguel A.; Bateman, Alex; Jarnot, Patryk; Gruca, Aleksandra; Grynberg, Marcin; Kajava, Andrey V.; Promponas, Vasilis J.; Anisimova, Maria; Jakobsen, Kjetill S.; Linke, Dirk

doi:10.1093/nar/gkz841

Cited by 231 publications

(207 citation statements)

References 138 publications

Supporting

Mentioning

204

Contrasting

Order By: Relevance

“…In brief, high quality Illumina reads were prepared using TrimGalore (https://github.com/FelixKrueger/TrimGalore) based on the following criteria: (i) no “N” base, (ii) trimming of adaptor sequences and low quality bases (Q<20), (iii) no trimmed reads < 100 bp. To avoid mis-assembly due to repetitive sequences (Tørresen et al, 2019), PacBio SEQUEL subreads with repetitive sequences comprised over 85% of total sequences were filtered out. The GC content criteria (<25% and >85%) was applied for filtering low complexity DNA sequences before assembly.…”

Section: Methodsmentioning

confidence: 99%

A chromosome-level assembly of the black tiger shrimp (Penaeus monodon) genome facilitates the identification of novel growth-associated genes

Uengwetwanit

Pootakham

Nookaew

et al. 2020

Preprint

View full text Add to dashboard Cite

The black tiger shrimp (Penaeus monodon) is one of the most prominent farmed crustacean species with an average annual global production of 0.5 million tons in the last decade. To ensure sustainable and profitable production through genetic selective breeding programs, several research groups have attempted to generate a reference genome using short-read sequencing technology. However, the currently available assemblies lack the contiguity and completeness required for accurate genome annotation due to the highly repetitive nature of the genome and technical difficulty in extracting high-quality, high-molecular weight DNA in this species. Here, we report the first chromosome-level whole-genome assembly of P. monodon. The combination of long-read Pacific Biosciences (PacBio) and long-range Chicago and Hi-C technologies enabled a successful assembly of this first high-quality genome sequence. The final assembly covered 2.39 Gb (92.3% of the estimated genome size) and contained 44 pseudomolecules, corresponding to the haploid chromosome number. Repetitive elements occupied a substantial portion of the assembly (62.5%), highest of the figures reported among crustacean species. The availability of this high-quality genome assembly enabled the identification of novel genes associated with rapid growth in the black tiger shrimp through the comparison of hepatopancreas transcriptome of slow-growing and fast-growing shrimps. The results highlighted several gene groups involved in nutrient metabolism pathways and revealed 67 newly identified growth-associated genes. Our high-quality genome assembly provides an invaluable resource for accelerating the development of improved shrimp strain in breeding programs and future studies on gene regulations and comparative genomics.

show abstract

Section: Methodsmentioning

confidence: 99%

A chromosome-level assembly of the black tiger shrimp (Penaeus monodon) genome facilitates the identification of novel growth-associated genes

Uengwetwanit

Pootakham

Nookaew

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Despite much interest [8,[14][15][16], the most recent and commonly cited census of protein TRs summarizing repeats in the curated protein knowledge base UniProtKB/Swiss-Prot [17] dates back two decades [18]. Since then this popular data bank has grown more than seven-fold ( Figure S1).…”

Section: Comprehensive Annotation Of Proteomic Tandem Repeatsmentioning

confidence: 99%

“…This allows our study to provide an unprecedented detail of the universe of protein TRs. We respond to the call [14] and apply the state-of-the-art method for TR detection followed by filtering through a sound statistical framework.…”

Section: Comprehensive Annotation Of Proteomic Tandem Repeatsmentioning

confidence: 99%

A New Census of Protein Tandem Repeats and Their Relationship with Intrinsic Disorder

Delucchi

Schaper

Sachenkova

et al. 2020

Genes

Self Cite

View full text Add to dashboard Cite

Protein tandem repeats (TRs) are often associated with immunity-related functions and diseases. Since that last census of protein TRs in 1999, the number of curated proteins increased more than seven-fold and new TR prediction methods were published. TRs appear to be enriched with intrinsic disorder and vice versa. The significance and the biological reasons for this association are unknown. Here, we characterize protein TRs across all kingdoms of life and their overlap with intrinsic disorder in unprecedented detail. Using state-of-the-art prediction methods, we estimate that 50.9% of proteins contain at least one TR, often located at the sequence flanks. Positive linear correlation between the proportion of TRs and the protein length was observed universally, with Eukaryotes in general having more TRs, but when the difference in length is taken into account the difference is quite small. TRs were enriched with disorder-promoting amino acids and were inside intrinsically disordered regions. Many such TRs were homorepeats. Our results support that TRs mostly originate by duplication and are involved in essential functions such as transcription processes, structural organization, electron transport and iron-binding. In viruses, TRs are found in proteins essential for virulence.

show abstract

“…Widely used sequencing technologies, such as Sanger, 454 and Illumina, have played a pivotal part in these advancements. However, the limitations of these technologies, namely their trouble reading through repetitive regions and their short read outputs, have led to assembly artifacts that are currently widely distributed in genome and proteome databases 43 . A number of protozoan parasite genomes have been recently revisited using third generation sequencing technologies.…”

Section: Discussionmentioning

confidence: 99%

Reevaluation of theToxoplasma gondiiandNeospora caninumgenomes reveals misassembly, karyotype differences and chromosomal rearrangements

Berná

Marquez

Cabrera

et al. 2020

Preprint

View full text Add to dashboard Cite

Neospora caninum primarily infects cattle causing abortions with an estimated impact of a billion dollars on worldwide economy, annually. However, the study of its biology has been unheeded by the established paradigm that it is virtually identical to its close relative, the widely studied human pathogen, Toxoplasma gondii. By revisiting the genome sequence, assembly and annotation using third generation sequencing technologies, here we show that the N. caninum genome was originally incorrectly assembled under the presumption of synteny with T. gondii. We show that major chromosomal rearrangements have occurred between these species. Importantly, we show that chromosomes originally annotated as ChrVIIb and VIII are indeed fused, reducing the karyotype of both N. caninum and T. gondii to 13 chromosomes. We reannotate the N. caninum genome, revealing over 500 new genes. We sequence and annotate the non-photosynthetic plastid and mitochondrial genomes, and show that while apicoplast genomes are virtually identical, high levels of gene fragmentation and reshuffling exists between species and strains. Our results correct assembly artifacts that are currently widely distributed in the genome database of N. caninum and T. gondii, but more importantly, highlight the mitochondria as a previously oversighted source of variability and pave the way for a change in the paradigm of synteny, encouraging rethinking the genome as basis of the comparative unique biology of these pathogens.

show abstract

Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases

Cited by 231 publications

References 138 publications

A chromosome-level assembly of the black tiger shrimp (Penaeus monodon) genome facilitates the identification of novel growth-associated genes

A chromosome-level assembly of the black tiger shrimp (Penaeus monodon) genome facilitates the identification of novel growth-associated genes

A New Census of Protein Tandem Repeats and Their Relationship with Intrinsic Disorder

Reevaluation of theToxoplasma gondiiandNeospora caninumgenomes reveals misassembly, karyotype differences and chromosomal rearrangements

Contact Info

Product

Resources

About