14De novo transcriptome assembly is a powerful tool, widely used over the last decade for making 15 evolutionary inferences. However, it relies on two implicit, untested assumptions: that the 16 assembled transcriptome represents an unbiased, if incomplete, representation of the underlying 17 expressed transcriptome, and that expression estimates from the assembly are good, if noisy 18 approximations of the relative abundance of expressed transcripts. Using publicly available data 19 for model organisms, we demonstrate that, across assembly algorithms, species, and data sets, 20 these assumptions are consistently violated. Using standard filtering approaches, coverage of 21 annotated genes by transcriptome assemblies falls far below complete coverage, even at the 22 less appropriate for studies that seek to understand patterns of genetic variation or gene 81 expression across populations or closely related species. Therefore, we focus on methodological 82 considerations for this class of investigations. 83 84 2. MATERIALS AND METHODS 85
Reference genomes 86For the purpose of benchmarking de novo transcriptome assemblies, and comparing assembly 87 and reference-based expression estimates, we downloaded from ENSEMBL genome and gtf 88 annotation files for the following organisms: house mouse, Mus musculus C57BL/6J 89 (GRCm38); clawed frog, Xenopus tropicalis (JGI_4.2); pufferfish, Tetraodon nigroviridis 90 (TETRAODON8); and fly, Drosophila melanogaster (BDGP6). 91 92
RNA-seq data 93Brain tissue is both transcriptionally complex and expected to express a large proportion of an 94 organism's overall transcriptional profile. For this reason, the bulk of our analyses focus on de 95 novo assemblies for data generated for experiments involving mouse (Mus spp.) brain. These 96 data sets include the Mus musculus (C57BL/6J) dendritic cell data used in the original Trinity 97 paper (MDC), a pool of six whole brains from albino inbred Mus (BALB/c), and 8-sample pools 98 of whole brain samples from wild M. musculus domesticus from Massif Central, France (FRA), 99 Iran (IRN), Kazakhstan (KZK), and Germany (DEU). To assess the generality or particular 100 results with respect to assembly composition, we also generated assemblies for pufferfish 101 (Tetraodon nigroviridis) whole brain, clawed frog (Xenopus tropicalis) kidney, and fly 102 (Drosophila melanogaster) heads. Data set SRA accessions, sequencing strategy and depth are 103 6 summarized in Supplementary Table 1. All libraries except for MDC were sequenced on an 104 Illumina HiSeq 2000; MDC was sequenced on an Illumina GAII. 105 106 2.3 Short read processing 107After an initial assessment of sequences reads with FASTQC 108