Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition

Wu, Ta Jen; Hsieh, Ya-Ching; Li, Lung-An

doi:10.1111/j.0006-341x.2001.00441.x

Cited by 109 publications

(78 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Whereas the lack of significant differences among Mahalanobisdistance-based methods can be partly attributed to the modest number of iterations used in the simulations (100) and the conservative Tukey criterion based on 18 means, we believe that only strong differences were of interest for attempting any general conclusions from these simulations. It is particularly interesting that the Euclidean distance-based methodd E (β i ,β j ) did not perform as well as our method, because in many multivariate applications a Euclidean distance is proposed as a computationally less expensive approximation to a Mahalanobis distance (e.g., Wu et al 2001).…”

Section: Discussionmentioning

confidence: 91%

A method for assigning species into groups based on generalized Mahalanobis distance between habitat model coefficients

Williams

Heglund

2008

Environ Ecol Stat

View full text Add to dashboard Cite

Habitat association models are commonly developed for individual animal species using generalized linear modeling methods such as logistic regression. We considered the issue of grouping species based on their habitat use so that management decisions can be based on sets of species rather than individual species. This research was motivated by a study of western landbirds in northern Idaho forests. The method we examined was to separately fit models to each species and to use a generalized Mahalanobis distance between coefficient vectors to create a distance matrix among species. Clustering methods were used to group species from the distance matrix, and multidimensional scaling methods were used to visualize the relations among species groups. Methods were also discussed for evaluating the sensitivity of the conclusions because of outliers or influential data points. We illustrate these methods with data from the landbird study conducted in northern Idaho. Simulation results are presented to compare the success of this method to alternative methods using Euclidean distance between coefficient vectors and to methods that do not use habitat association models. These simulations demonstrate that our Mahalanobis-distance-based method was nearly always better than Euclidean-distance-based methods or methods not based on habitat association models. The methods used to develop candidate species groups are easily explained to other scientists and resource managers since they mainly rely on classical multivariate statistical methods.

show abstract

Section: Discussionmentioning

confidence: 91%

A method for assigning species into groups based on generalized Mahalanobis distance between habitat model coefficients

Williams

Heglund

2008

Environ Ecol Stat

View full text Add to dashboard Cite

show abstract

“…The most commonly used measures are Euclidean distance, d 2 distance (a weighted Euclidean distance), Mahalanobis distance and Kullback-Leibler discrepancy (KLD). Since Wu, Hsieh, and Li (2001) find in their experiments that KLD provides good results while it still can be computed as fast as Euclidean distance, it is also used here. Since KLD becomes −∞ for counts of zero, we add one to all counts which conceptually means that we start building the EMM with a prior that all triplets have the equal occurrence probability (see Wu et al 2001).…”

Section: Genetic Sequence Analysismentioning

confidence: 99%

rEMM: Extensible Markov Model for Data Stream Clustering inR

Hahsler¹,

Dunham²

2010

J. Stat. Soft.

View full text Add to dashboard Cite

Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of data points which is used by clustering, but also by a temporal component. The extensible Markov model (EMM) adds the temporal component to data stream clustering by superimposing a dynamically adapting Markov chain. In this paper we introduce the implementation of the R extension package rEMM which implements EMM and we discuss some examples and applications.

show abstract

“…Sensitivity and selectivity were computed to evaluate and compare the performance of the proposed models with other distance measures [33]. Sensitivity is expressed by the number of A. testaceum related sequences found among the first closest five library sequences.…”

Section: Similarity Searchmentioning

confidence: 99%

Use of statistical measures for analyzing RNA secondary structures

Dai

Wang

2008

J Comput Chem

View full text Add to dashboard Cite

With more and more RNA secondary structures accumulated, the need for comparing different RNA secondary structures often arises in function prediction and evolutionary analysis. Numerous efficient algorithms were developed for comparing different RNA secondary structures, but challenges remain. In this article, a new statistical measure extending the notion of relative entropy based on the proposed stochastic model is evaluated for RNA secondary structures. The results obtained from several experiments on real datasets have shown the effectiveness of the proposed approach. Moreover, the time complexity of our method is favorable by comparing with that of the existing methods which solve the similar problem.

show abstract

Statistical Measures of DNA Sequence Dissimilarity under Markov Chain Models of Base Composition

Cited by 109 publications

References 15 publications

A method for assigning species into groups based on generalized Mahalanobis distance between habitat model coefficients

A method for assigning species into groups based on generalized Mahalanobis distance between habitat model coefficients

rEMM: Extensible Markov Model for Data Stream Clustering inR

Use of statistical measures for analyzing RNA secondary structures

Contact Info

Product

Resources

About