Full-length transcriptome assembly from RNA-Seq data without a reference genome

Grabherr, Manfred; Haas, Brian J.; Yassour, Moran; Levin, Joshua Z.; Thompson, Dawn; Amit, Ido; Adiconis, Xian; Lin, Fan; Raychowdhury, Raktima; Zeng, Qiandong; Chen, Zehua; Mauceli, Evan; Hacohen, Nir; Gnirke, Andreas; Rhind, Nick; Palma, Federica Di; Birren, Bruce W.; Nusbaum, Chad; Lindblad‐Toh, Kerstin; Friedman, Nir; Regev, Aviv

doi:10.1038/nbt.1883

Cited by 16,767 publications

(12,476 citation statements)

References 34 publications

Supporting

Mentioning

12,434

Contrasting

Unclassified

Order By: Relevance

“…Read pairs where both reads were ultimately of at least 36 base pairs in length following this quality control process were retained and used for subsequent analyses. Trinity (v.r20140717) 44 was used to assemble quality filtered data. Assembled transcripts were aligned to our genome sequence using NCBI blastn v.2.2.30+ with an e-value cut-off of 1 × 10 −5 .…”

Section: Methodsmentioning

confidence: 99%

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum

VanBuren

Bryant

Edger

et al. 2015

Nature

281

248

View full text Add to dashboard Cite

Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly 1 . The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE) 2 . Here we report the whole-genome sequencing and assembly of the desiccationtolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetium genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a 'near-complete' draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. The Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.The genomes of Arabidopsis 3 , rice 4 , poplar, grape and Sorghum 5 were first sequenced using high-quality and reiterative Sanger-based approaches producing a series of 'gold standard' reference genomes. The advent of next-generation sequencing (NGS) technologies reduced costs of sequencing substantially, which has enabled sequencing of over 100 plant genomes 1 . The quality of plant genome assemblies depends on genome size, ploidy, heterozygosity and sequence coverage, but most NGS-based genomes have on the order of tens of thousands of short contigs distributed in thousands of scaffolds. The short read lengths of NGS, inherent biases and non-random sequencing errors have resulted in highly fragmented draft genome assemblies that are not complete, which means they are missing biologically meaningful sequences including entire genes, regulatory regions, transposable elements, centromeres, telomeres and haplotype-specific structural variations. It is becoming clear from ENCODE projects that complete genomes are needed to better understand the importance of the non-coding regions of genomes 2 .More than 40% of calories consumed by humans are derived from grasses, and the grass family (Poaceae) is arguably the most important plant family with regard to global food security 6 . The size and complexity of most grass genomes has challenged progress in gene discovery and comparative genomics, although draft genomes are now available for most agriculturally important grasses 1 . The largest genome assemblies, such as maize (2,300 megabases (Mb)) 7 , barley (5,100 Mb) 8 and wheat (hexaploid, 1...

show abstract

Section: Methodsmentioning

confidence: 99%

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum

VanBuren

Bryant

Edger

et al. 2015

Nature

281

248

View full text Add to dashboard Cite

show abstract

“…We confirmed our list had over 90% overlap with the curated ortholog list on Flybase (St Pierre et al., 2014; which covered nine of our species). To improve the quality of ortholog sequences, Trinity (Grabherr et al., 2011) was also used to de novo assemble the transcriptomes from the RNA‐seq data (Table S1). Poorly annotated Species CDS (e.g., those without proper start or stop codons) were replaced by Trinity (Grabherr et al., 2011) transcripts where applicable, and we ensured that the final list of orthologs contained at least 20% conserved blocks in the multiple sequence alignment.…”

Section: Methodsmentioning

confidence: 99%

“…To improve the quality of ortholog sequences, Trinity (Grabherr et al., 2011) was also used to de novo assemble the transcriptomes from the RNA‐seq data (Table S1). Poorly annotated Species CDS (e.g., those without proper start or stop codons) were replaced by Trinity (Grabherr et al., 2011) transcripts where applicable, and we ensured that the final list of orthologs contained at least 20% conserved blocks in the multiple sequence alignment. The final ortholog sequences were mapped back to their respective genomes with GMAP (Wu & Watanabe, 2005) to generate customized GFF (General Feature Format) files.…”

Section: Methodsmentioning

confidence: 99%

Comparative transcriptomics across 14 Drosophila species reveals signatures of longevity

Avanesov

Porter

et al. 2018

Aging Cell

View full text Add to dashboard Cite

SummaryLifespan varies dramatically among species, but the biological basis is not well understood. Previous studies in model organisms revealed the importance of nutrient sensing, mTOR, NAD/sirtuins, and insulin/IGF1 signaling in lifespan control. By studying life‐history traits and transcriptomes of 14 Drosophila species differing more than sixfold in lifespan, we explored expression divergence and identified genes and processes that correlate with longevity. These longevity signatures suggested that longer‐lived flies upregulate fatty acid metabolism, downregulate neuronal system development and activin signaling, and alter dynamics of RNA splicing. Interestingly, these gene expression patterns resembled those of flies under dietary restriction and several other lifespan‐extending interventions, although on the individual gene level, there was no significant overlap with genes previously reported to have lifespan‐extension effects. We experimentally tested the lifespan regulation potential of several candidate genes and found no consistent effects, suggesting that individual genes generally do not explain the observed longevity patterns. Instead, it appears that lifespan regulation across species is modulated by complex relationships at the system level represented by global gene expression.

show abstract

“…Short fragments (K‐mers) were obtained by applying Trinity software (Grabherr et al. 2011) to cleave clean reads. The K‐mers were configured into long segments (contigs), and then the overlap between these contigs was utilized to obtain fragment collections (components).…”

Section: Methodsmentioning

confidence: 99%

Transcriptome response to temperature stress in the wolf spider Pardosa pseudoannulata (Araneae: Lycosidae)

Xiao

Wang

Cao

et al. 2016

Ecology and Evolution

View full text Add to dashboard Cite

The wolf spider Pardosa pseudoannulata is a dominant predator in paddy ecosystem and an important biological control agent of rice pests. Temperature represents a primary factor influencing its biology and behavior, although the underlying molecular mechanisms remain unknown. To understand the response of P. pseudoannulata to temperature stress, we performed comparative transcriptome analyses of spider adults exposed to 10°C and 40°C for 12 h. We obtained 67,725 assembled unigenes, 21,765 of which were annotated in P. pseudoannulata transcriptome libraries, and identified 905 and 834 genes significantly up‐ or down‐regulated by temperature stress. Functional categorization revealed the differential regulation of transcription, signal transduction, and metabolism processes. Calcium signaling pathway and metabolic pathway involving respiratory chain components played important roles in adapting to low temperature, whereas at high temperature, oxidative phosphorylation and amino acid metabolism were critical. Differentially expressed ribosomal protein genes contributed to temperature stress adaptation, and heat shock genes were significantly up‐regulated. This study represents the first report of transcriptome identification related to the Araneae species in response to temperature stress. These results will greatly facilitate our understanding of the physiological and biochemical mechanisms of spiders in response to temperature stress.

show abstract

Full-length transcriptome assembly from RNA-Seq data without a reference genome

Cited by 16,767 publications

References 34 publications

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeum

Comparative transcriptomics across 14 Drosophila species reveals signatures of longevity

Transcriptome response to temperature stress in the wolf spider Pardosa pseudoannulata (Araneae: Lycosidae)

Contact Info

Product

Resources

About