2012
DOI: 10.1093/nar/gks1285
|View full text |Cite
|
Sign up to set email alerts
|

One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses

Abstract: The structural simplicity and ability to capture serial correlations make Markov models a popular modeling choice in several genomic analyses, such as identification of motifs, genes and regulatory elements. A critical, yet relatively unexplored, issue is the determination of the order of the Markov model. Most biological applications use a predetermined order for all data sets indiscriminately. Here, we show the vast variation in the performance of such applications with the order. To identify the ‘optimal’ o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
21
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 23 publications
(21 citation statements)
references
References 41 publications
0
21
0
Order By: Relevance
“…The random simulation is reasonable as most of virus-host pairs should not interact. To model the background of the sequences, we used Bayesian information criterion (BIC) to estimate the MC order for the 352 phage sequences as in (35). About 70% of the phage sequences have an estimated MC order of 2.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The random simulation is reasonable as most of virus-host pairs should not interact. To model the background of the sequences, we used Bayesian information criterion (BIC) to estimate the MC order for the 352 phage sequences as in (35). About 70% of the phage sequences have an estimated MC order of 2.…”
Section: Resultsmentioning
confidence: 99%
“…We briefly describe the different measures here. For measures that do not consider background k -mer frequencies, we used several common methods for computing the distance between two vectors, in this case observed k -mer frequencies of each pair of host and viral sequences: Euclidean distance ( Eu ), Manhattan distance ( Ma ), Chebyshev distance ( Ch ), \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}${d_2}$\end{document} (34) and Jensen-Shannon divergence ( JS ) (35). The background normalization methods, including \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$d_2^*$\end{document}, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$d_2^S$\end{document} (23), Hao (36,37) , Teeling (38) , EuF (20) and Willner (39), incorporate different forms of sequence background models to compute the divergence between the observed and expected k -mer frequencies to eliminate the effect of the background average k -mer counts and enhance the signal of differences between the host and viral sequences.…”
Section: Methodsmentioning
confidence: 99%
“…To estimate the order of MC based on the NGS sample for each of the 28 species, we apply the order estimatorr S k in (12); there is no sharp ratio transition found over k = 2, · · · , 14. Given that real genomes consist of multiple types of regions (coding, non-coding and regulatory regions) and each type may fit to different MC models, the result indicates that no suitable MC model can adequately fit all the patterns in the genome.…”
Section: Applications To the Study Of Relationships Among Organismsmentioning
confidence: 99%
“…This is the case, for example for the aforementioned HOT regions 12 , many of which are known to act as early developmental enhancers 72 . Partially, this phenomenon can be explained by the inevitable limitations of models used for predicting the TFs' DNA binding preferences 73 , 74 . However, it is also possible that TF recruitment to these regions is facilitated or strengthened by protein-protein interactions 69 , 75 77 , as in the ‘TF collective’ model that we have recently proposed 69 , 78 .…”
Section: The Output Of Tf Binding As a ‘Dose Of Activation’mentioning
confidence: 99%