2006 Fortieth Asilomar Conference on Signals, Systems and Computers 2006
DOI: 10.1109/acssc.2006.355081
|View full text |Cite
|
Sign up to set email alerts
|

An Improved Minimum Description Length Learning Algorithm for Nucleotide Sequence Analysis

Abstract: We present an improved Minimum Description Length (MDL) Learning Algorithm -MDLCompress -for nucleotide sequence analysis that outperforms the compression of other Grammar Based Coding methods such as DNA Sequitur while retaining a two-part code that highlights biologically significant phrases. Phrases are recursively added to the MDLCompress model that are not necessarily the longest matches, or the most often repeated phrase of a certain length, but a combination of length and repetition such that inclusion … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0
1

Year Published

2007
2007
2022
2022

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 20 publications
0
3
0
1
Order By: Relevance
“…The algorithm devised by Evans et al (2006) (also Markham et al 2009) searches for the best set of substrings to encode an input string according to the proposed Optimal Symbol Compression Ratio (OSCR) (Evans et al 2003). The algorithm, which has been applied primarily to analyse genetic sequences (Evans et al 2007), is iterative, at each step picking the substring that compresses most and replacing it by a temporary code.…”
Section: Mining Substringsmentioning
confidence: 99%
“…The algorithm devised by Evans et al (2006) (also Markham et al 2009) searches for the best set of substrings to encode an input string according to the proposed Optimal Symbol Compression Ratio (OSCR) (Evans et al 2003). The algorithm, which has been applied primarily to analyse genetic sequences (Evans et al 2007), is iterative, at each step picking the substring that compresses most and replacing it by a temporary code.…”
Section: Mining Substringsmentioning
confidence: 99%
“…(2) MDLcompress was designed with the express intent of estimating the algorithmic minimum sufficient statis-tic, and thus has more stringent separation of model and data costs and more specific model cost calculations resulting in greater specificity. (3) As described in [21] and will be discussed in later sections, the computational architecture of MDLcompress differs from the suffix tree with counts architecture of GREEDY. Specifically, MDLcompress gathers statistics in a single pass and then updates the data structure and statistics after selecting each phrase as opposed GREEDY's practice of reforming the suffix tree with counts data structure at each iteration.…”
Section: Examplementioning
confidence: 99%
“…In this paper, we describe initial results of miRNA analysis using OSCR and introduce improvements to OSCR that reduce execution time and enhance its capacity to identify biologically meaningful sequence. These modifications, some of which were first introduced in [21], retain the deep recursion of the original algorithm but exploit novel data structures that make more efficient use of time and memory by gathering phrase statistics in a single pass and subsequently selecting multiple codebook phrases. Our data structure incorporates candidate phrase frequency information and pointers identifying location of candidate phrases in the sequence, enabling efficient computation.…”
Section: Introductionmentioning
confidence: 99%
“…[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ]…”
unclassified