Cookiecutter: a tool for kmer-based read filtering and extraction

Starostina, Ekaterina; Tamazian, Gaik; Dobrynin, Pavel; O’Brien, Stephen J.; Komissarov, Aleksey

doi:10.1101/024679

Cited by 24 publications

(24 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We used the Jellyfish software [ 74 ] for computing 23-mer frequencies and choosing a subset of 23-mers with coverage greater than 1,000. We used the Cookiecutter package [ 103 ] for extraction of raw reads containing subset of 23-mers with coverage greater than 1000. The selected reads were used to manually assemble tandem repeat monomer consensus sequences with the help of the targeted de novo short-read assembler PRICE [ 104 ].…”

Section: Methodsmentioning

confidence: 99%

Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding

et al. 2016

Self Cite

View full text Add to dashboard Cite

We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer), a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species’ native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics.

show abstract

Section: Methodsmentioning

confidence: 99%

Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding

et al. 2016

Self Cite

View full text Add to dashboard Cite

show abstract

“…Spa-1, a total of 52,358,830 paired-end reads were generated, equating to approximately 13.09 Gb of sequence data. For five southern solenodons (S. p. woodi), an average of 151,783,327 paired-end reads were generated, equating to an average of 15. were also assembled using a Bruijn graph based algorithm that considers coverage, based on the software Cookiecutter (Starostina et al 2015). Both produced identical results outside of the control region and open reading frames were present in all coding regions which might otherwise indicate assembly of numts (Lopez et al 1994).…”

Section: Mitogenome Sequence Generation Assembly and Annotationmentioning

confidence: 99%

Mitogenomic sequences support a north–south subspecies subdivision within Solenodon paradoxus

Brandt

Grigorev

Afanador-Hernández

et al. 2016

Mitochondrial DNA Part A

Self Cite

View full text Add to dashboard Cite

Solenodons are insectivores found only in Hispaniola and Cuba, with a Mesozoic divergence date versus extant mainland mammals. Solenodons are the oldest lineage of living eutherian mammal for which a mitogenome sequence has not been reported. We determined complete mitogenome sequences for six Hispaniolan solenodons (Solenodon paradoxus) using next-generation sequencing. The solenodon mitogenomes were 16,454-16,457 bp long and carried the expected repertoire of genes. A mitogenomic phylogeny confirmed the basal position of solenodons relative to shrews and moles, with solenodon mitogenomes estimated to have diverged from those of other mammals ca. 78 Mya. Control region sequences of solenodons from the northern (n = 3) and southern (n = 5) Dominican Republic grouped separately in a network, with F = 0.72 (p = 0.036) between north and south. This regional genetic divergence supports previous morphological and genetic reports recognizing northern (S. p. paradoxus) and southern (S. p. woodi) subspecies in need of separate conservation plans.

show abstract

“…Machine learning will help to select which measure of sequence complexity is more predictive of read alignment performance. Some read trimming, masking, or filtering software uses sequence complexity Porter and Zhang (2017); Starostina et al (2015). The bisulfite software BatMeth has a low complexity filter using Shannon entropy Lim et al (2012), and BLAST can use a sequence complexity mask with the DUST score Morgulis et al (2006); Altschul et al (1990).…”

Section: Related Work and Motivationmentioning

confidence: 99%

Using machine learning to predict DNA read alignment quality

Porter

2018

Preprint

View full text Add to dashboard Cite

An empirical understanding of how DNA read features affect read mapping and alignment quality could be useful in designing better read mapping and alignment software, read trimmers, and sequence masks. Many programs appear to use arbitrarily chosen features that are putatively relevant to DNA alignment quality. Machine learning gives a ready way to empirically assess a variety of features and rank them according to their importance. Sequence complexity features such as run length distribution, DUST, and entropy and quality measures from the DNA read data were used to predict read mapping quality on Ion Torrent and Illumina data sets using both bisulfite-treated and untreated short DNA reads. Surprisingly, run length distribution mean and variance did as well or better than DUST and entropy even though several programs use DUST and entropy. Predictive accuracy of the models had F1-scores between 0.5-0.95; thus, the feature set is useful for understanding alignment quality.

show abstract

Cookiecutter: a tool for kmer-based read filtering and extraction

Cited by 24 publications

References 11 publications

Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding

Chromosomal-Level Assembly of the Asian Seabass Genome Using Long Sequence Reads and Multi-layered Scaffolding

Mitogenomic sequences support a north–south subspecies subdivision within Solenodon paradoxus

Using machine learning to predict DNA read alignment quality

Contact Info

Product

Resources

About