2013
DOI: 10.1093/bib/bbt088
|View full text |Cite
|
Sign up to set email alerts
|

Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies

Abstract: High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
33
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 52 publications
(34 citation statements)
references
References 85 publications
1
33
0
Order By: Relevance
“…The first algorithms from 2009 were soon followed by more mature proposals, which will be presented below, focusing on their indexing capabilities. More information on genome data compressors and indexes can be found in the recent surveys (Vyverman et al, 2012;Deorowicz and Grabowski, 2013;Giancarlo et al, 2013). Mäkinen et al (2010) added index functionalities to compressed DNA sequences: display (which can also be called the random access functionality) returning the substring specified by its start and end position, count telling the number of times the given pattern occurs in the text, and locate listing the positions of the pattern in the text.…”
Section: Introductionmentioning
confidence: 99%
“…The first algorithms from 2009 were soon followed by more mature proposals, which will be presented below, focusing on their indexing capabilities. More information on genome data compressors and indexes can be found in the recent surveys (Vyverman et al, 2012;Deorowicz and Grabowski, 2013;Giancarlo et al, 2013). Mäkinen et al (2010) added index functionalities to compressed DNA sequences: display (which can also be called the random access functionality) returning the substring specified by its start and end position, count telling the number of times the given pattern occurs in the text, and locate listing the positions of the pattern in the text.…”
Section: Introductionmentioning
confidence: 99%
“…The modern high-throughput technologies produce high amounts of sequence collections of data, and several methodologies have been proposed for their efficient storage and analysis [34,35]. Recently, approaches based on MapReduce and big data technologies have been proposed (see, e.g., [36], and [3] for a complete review on this topic).…”
Section: Big Data Based Approaches For the Analysis Of Biological Seqmentioning
confidence: 99%
“…An important issue in this context is the computation of k-mer statistics, that becomes challenging when sets of sequences at a genomic scale are involved. Due to the importance of this task in several applications (e.g., genome assembly [37] and alignment-free sequence analysis [34,35]) many methods that use shared-memory multi processor architectures or distributed computing have been proposed. The basic pattern followed by most of these methods is to maintain a shared data structure (typically, a hash table) to be updated according to the k-mers extracted from a collection of input files by one or more concurrent tasks.…”
Section: Big Data Based Approaches For the Analysis Of Biological Seqmentioning
confidence: 99%
“…In these file formats, other information, such as identifiers and quality scores, are added to the raw genomic or protein sequences. It is worth noting that there are other formats for storing biological data, such as Illumina Export format, but they were developed for targeting a specific technology [4].…”
Section: File Formatsmentioning
confidence: 99%
“…Also, the costs associated with the storage, processing and transmission of HTS data are higher compared to sequence generation, that makes the situation more complicated [1]. Compression is a solution that is able to overcome these challenges, by reducing the storage size and processing costs, such as I/O bandwidth, as well as increasing transmission speed [2][3][4].…”
Section: Introductionmentioning
confidence: 99%