2007
DOI: 10.1021/ci700200n
|View full text |Cite
|
Sign up to set email alerts
|

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

Abstract: Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here we combine statistical models of fingerprints with integer entr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
82
0

Year Published

2008
2008
2016
2016

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 49 publications
(83 citation statements)
references
References 29 publications
1
82
0
Order By: Relevance
“…The following methods support the compression of FASTQ files. The GenCompress method [101], first, maps short sequences to a reference genome; then, it encodes the addresses of the short sequences, their length and their probable substitutions, using entropy coding algorithms, e.g., Golomb [97], Elias Gamma [102], MOV (Monotone Value) [103] or Huffman coding [59]. Similar to GenCompress, the G-SQZ scheme [104] employs Huffman coding; however, it does compression without altering the relative order.…”
Section: Fastqmentioning
confidence: 99%
“…The following methods support the compression of FASTQ files. The GenCompress method [101], first, maps short sequences to a reference genome; then, it encodes the addresses of the short sequences, their length and their probable substitutions, using entropy coding algorithms, e.g., Golomb [97], Elias Gamma [102], MOV (Monotone Value) [103] or Huffman coding [59]. Similar to GenCompress, the G-SQZ scheme [104] employs Huffman coding; however, it does compression without altering the relative order.…”
Section: Fastqmentioning
confidence: 99%
“…The researching methods are to improve storage and retrieval times of fingerprints [27]. Instead of simply storing the fingerprints as a string of binary bits, the researchers also calculate a new fingerprint representation based on Golomb and Golomb-Rice Codes.…”
Section: Fingerprints With Entropy Codesmentioning
confidence: 99%
“…Indeed, recent compression schemes have shown that it is effective to view a genomic string with respect to a compression scheme that represents a string in terms of its differences with a reference string, R (e.g., see [4]). That is, we can start from a reference string, R, which contains the most common components of a typical genomic string.…”
Section: Exploiting Genomic Data Distributionsmentioning
confidence: 99%
“…All the sequences were aligned to the reference sequence and, for each sequence, the indices of the location of each variation were recorded together with the type (substitution, insertion, deletion) and content of each variation. This step is also essential if one is interested in compressing the data [4], for example. Statistics for the number of substitutions, deletions, and insertions for this data set of 1000 mtDNA sequences is given in Table 1 Of the 1000 sequences, 453 have only substitution events with respect to the reference string, R = rCRS.…”
Section: Experimental Analysismentioning
confidence: 99%