An Improved Minimum Description Length Learning Algorithm for Nucleotide Sequence Analysis

Evans, S.; Markham, S.; Torres, Andrew S.; Kourtidis, Antonis; Conklin, Douglas S.

doi:10.1109/acssc.2006.355081

“…The algorithm devised by Evans et al (2006) (also Markham et al 2009) searches for the best set of substrings to encode an input string according to the proposed Optimal Symbol Compression Ratio (OSCR) (Evans et al 2003). The algorithm, which has been applied primarily to analyse genetic sequences (Evans et al 2007), is iterative, at each step picking the substring that compresses most and replacing it by a temporary code.…”

Section: Mining Substringsmentioning

confidence: 99%

The minimum description length principle for pattern mining: a survey

Galbrun

¹

2022

Data Min Knowl Disc

View full text Add to dashboard Cite

Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The Minimum Description Length (MDL) principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, we review MDL-based methods for mining different kinds of patterns from various types of data. Finally, we open a discussion on some issues regarding these methods.

show abstract

“…(2) MDLcompress was designed with the express intent of estimating the algorithmic minimum sufficient statis-tic, and thus has more stringent separation of model and data costs and more specific model cost calculations resulting in greater specificity. (3) As described in [21] and will be discussed in later sections, the computational architecture of MDLcompress differs from the suffix tree with counts architecture of GREEDY. Specifically, MDLcompress gathers statistics in a single pass and then updates the data structure and statistics after selecting each phrase as opposed GREEDY's practice of reforming the suffix tree with counts data structure at each iteration.…”

Section: Examplementioning

confidence: 99%

“…In this paper, we describe initial results of miRNA analysis using OSCR and introduce improvements to OSCR that reduce execution time and enhance its capacity to identify biologically meaningful sequence. These modifications, some of which were first introduced in [21], retain the deep recursion of the original algorithm but exploit novel data structures that make more efficient use of time and memory by gathering phrase statistics in a single pass and subsequently selecting multiple codebook phrases. Our data structure incorporates candidate phrase frequency information and pointers identifying location of candidate phrases in the sequence, enabling efficient computation.…”

Section: Introductionmentioning

confidence: 99%

MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress

Evans¹,

Kourtidis²,

Markham³

et al. 2007

EURASIP Journal on Bioinformatics and Systems Biology

Self Cite

View full text Add to dashboard Cite

We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.

show abstract

“…[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ]…”

unclassified

MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress

Evans¹,

Kourtidis²,

Markham³

et al. 2007

EURASIP J Bioinform Syst Biol

Self Cite

View full text Add to dashboard Cite

We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.

show abstract

An Improved Minimum Description Length Learning Algorithm for Nucleotide Sequence Analysis

Cited by 5 publications

References 20 publications

The minimum description length principle for pattern mining: a survey

The minimum description length principle for pattern mining: a survey

MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress

MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress

Contact Info

Product

Resources

About