2020
DOI: 10.1101/gr.260604.119
|View full text |Cite
|
Sign up to set email alerts
|

Data structures based on k-mers for querying large collections of sequencing data sets

Abstract: High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of th… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
89
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 83 publications
(89 citation statements)
references
References 69 publications
0
89
0
Order By: Relevance
“…The lesser detrimental effects of the maize B suggests that it is generally devoid of dosage-sensitive genes. In vertebrates, three different homologous chromosome pairs have evolved into heteromorphic sex chromosomes with the loss of many genes ( 56 ). However, dosage-sensitive genes are retained between the chromosome pair, illustrating selection against a two-fold change in dosage.…”
Section: Discussionmentioning
confidence: 99%
“…The lesser detrimental effects of the maize B suggests that it is generally devoid of dosage-sensitive genes. In vertebrates, three different homologous chromosome pairs have evolved into heteromorphic sex chromosomes with the loss of many genes ( 56 ). However, dosage-sensitive genes are retained between the chromosome pair, illustrating selection against a two-fold change in dosage.…”
Section: Discussionmentioning
confidence: 99%
“…Moreover, many other requests could be easily considered for annotated gene exploration like gene co-expression, or to compensate the lack of completeness in genomic or transcriptomic references to cover unreferenced RNA diversity and search for new spliced events, intron retention or new transcript categories including circular RNAs. In order to increase the potential of the k -mer approach, access to very large-scale datasets like SRA level (164 000 human samples) could be considered with efficient indexing structure development ( 43 ).…”
Section: Discussionmentioning
confidence: 99%
“…K-mer-based methods have been applied for sequence comparison for error correction (1), genome assembly (2,3), metagenomic (4) and chromosome (5) sequence classification, sequence clustering (6), database search (7,8), structural variation detection (9,10), transcriptome analysis (11,12), and many other applications. Because of this widespread use, many data structures and techniques for efficiently storing, querying, and counting k-mers have been proposed (see (13) for a review). While k-mers has proven to be practical in several sequence comparison problems, they are sensitive to mutations.…”
Section: Introductionmentioning
confidence: 99%
“…K-mer-based methods have been applied for sequence comparison for error correction (1), genome assembly (2, 3), metagenomic (4) and chromosome (5) sequence classification, sequence clustering (6), database search (7, 8), structural variation detection (911), transcriptome analysis (12, 13), DNA barcoding of species (14), estimation of genome size (15), identification of biomarkers (16), and many other applications. Because of the widespread use of k-mers, many data structures and techniques for efficiently storing and querying k-mers have been proposed (see (17) for a review).…”
Section: Introductionmentioning
confidence: 99%