Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

Darvish, Mitra; Seiler, Enrico; Mehringer, Svenja; Rahn, René; Reinert, Knut

doi:10.1093/bioinformatics/btac492

Cited by 6 publications

(5 citation statements)

References 29 publications

(65 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Means of k-mer counts best correlated with qPCR abundance and transcript-per-million (TPM) measured from RNA-seq reads by Kallisto [10] (Fig 2A,B, Fig S2), while sums of k-mer counts best correlated with raw RNA-seq counts (Fig 2C, Fig S2). Correlation coefficients (CC) with Kallisto counts were around 0.8, in line with previous reports [6]. We found that quantification accuracy could be substantially improved by masking query k-mers with multiple instances in the human genome (Methods).…”

Section: Resultssupporting

confidence: 89%

“…To determine the optimal counting scheme, we used the SEQC/MAPQC dataset in which the abundance of 1000 transcripts was evaluated in 16 samples both by qPCR and Illumina RNA-seq [9]. Means of k-mer counts best correlated with qPCR abundance and transcript-per-million (TPM) measured from RNA-seq reads by Kallisto [10] Correlation coefficients (CC) with Kallisto counts were around 0.8, in line with previous reports [6]. We found that quantification accuracy could be substantially improved by masking query k-mers with multiple instances in the human genome (Methods).…”

Section: Accuracy Of Rna Expression Measurementioning

confidence: 54%

“…Three recent tools enable quantitative queries in large sequence sets. Needle [6] implements multiple interleaved Bloom filters and sketches of minimisers, which enable storing counts in a semi-quantitative way. Metagraph [7] uses an optimized De Bruijn Graph structure, enabling to store either presence-absence or count information.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Exploring a large cancer cell line RNA-sequencing dataset with k-mers

Bessière,

Xue,

Guibert

et al. 2024

Preprint

View full text Add to dashboard Cite

Analyzing the immense diversity of RNA isoforms in large RNA-seq repositories requires laborious data processing using specialized tools. Indexing techniques based on k-mers have previously been effective at searching for RNA sequences across thousands of RNA-seq libraries but falling short of enabling direct RNA quantification. We show here that RNAs queried in the form of k-mer sets can be quantified in seconds, with a precision akin to that of conventional RNA quantification methods. We showcase several applications by exploring an index of the Cancer Cell Line Encyclopedia (CCLE) collection consisting of 1019 RNA-seq samples. Non-reference RNA sequences such as RNAs harboring driver mutations and fusions, splicing isoforms or RNAs derived from repetitive elements, can be retrieved with high accuracy. Moreover, we show that k-mer indexing offers a powerful means to reveal variant RNAs induced by specific gene alterations, for instance in splicing factors. A web server allows public queries in CCLE and other indexes: https://transipedia.fr. Code is provided to allow users to set up their own server from any RNA-seq dataset.

show abstract

Section: Resultssupporting

confidence: 89%

Section: Accuracy Of Rna Expression Measurementioning

confidence: 54%

See 1 more Smart Citation

Exploring a large cancer cell line RNA-sequencing dataset with k-mers

Bessière,

Xue,

Guibert

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Indexes capable of taking abundance into account are an ongoing major challenge, which has seen significant contributions in recent years, see for example [10, 7, 4]. The method we propose in this article is based solely on the presence/absence of k -mers; an interesting prospect for future work would be to take into account abundances – i.e.…”

Section: Discussionmentioning

confidence: 99%

Constrained enumeration ofk-mers from a collection of references with metadata

Ingels,

Martayan,

Salson

et al. 2024

Preprint

View full text Add to dashboard Cite

While recent developments ink-mers indexing methods have opened up many new possibilities, they still have limitations in handling certain types of queries, such as identifyingk-mers present in one dataset but absent in another. In this article, we present a framework for efficiently enumerating allk-mers within a collection of references that satisfy constraints related to their metadata tags. Our method involves simplifying the query beforehand to reduce computation delays; the construction of the solution itself is carried out using CBL, a recent data structure specifically dedicated to the optimised computation of set operations onk-mer sets. We provide an implementation to our solution and we demonstrate its capabilities using real genomic data (microbial and RNA-seq), and show examples of use cases to identifyk-mers of biological interest.FundingThis work is funded by a grant from the French ANR: Full-RNA ANR-22-CE45-0007. Igor Martayan is supported by a doctoral grant from ENS Rennes.

show abstract

“…Bloom filters. Sketching approaches such as sourmash [20], or Needle [9] typically suffer from high false negative rates when short sequences are queried, and are thus out of the scope of this work. Methods based on exact representations (e.g.…”

mentioning

confidence: 99%

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

Lemane,

Lezzoche,

Lecubin

et al. 2023

Preprint

View full text Add to dashboard Cite

Despite their wealth of biological information, public sequencing databases are largely underutilized. One cannot efficiently search for a sequence of interest in these immense resources. Sophisticated computational methods such as approximate membership query data structures allow searching for fixed-length words (k-mers) in large datasets. Yet they face scalability challenges when applied to thousands of complex sequencing experiments. In this context we propose kmindex, a new approach that uses inverted indexes based on Bloom filters. Thanks to its algorithmic choices and its fine-tuned implementation, kmindex offers the possibility to index thousands of highly complex metagenomes into an index that answers sequences queries in the tenth of a second. Index construction is one order of magnitude faster than previous approaches, and query time is two orders of magnitude faster. Based on Bloom filters, kmindex achieves negligible false positive rates, below 0.01% on average. Its average false positive rate is four orders of magnitude lower than existing approaches, for similar index sizes. It has been successfully used to index 1,393 complex marine seawater metagenome samples of raw sequences from the Tara Oceans project, demonstrating its effectiveness on large and complex datasets. This level of scaling was previously unattainable. Building on the kmindex results, we provide a public web server named "Ocean Read Atlas" (ORA) at https://ocean-read-atlas.mio.osupytheas.fr/ that can answer queries against the entire Tara Oceans dataset in real-time. kmindex is open-source software available at https://github.com/tlemane/kmindex.

show abstract

Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

Cited by 6 publications

References 29 publications

Exploring a large cancer cell line RNA-sequencing dataset with k-mers

Exploring a large cancer cell line RNA-sequencing dataset with k-mers

Constrained enumeration ofk-mers from a collection of references with metadata

kmindex and ORA: indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets

Contact Info

Product

Resources

About