1990
DOI: 10.1002/prot.340070105
|View full text |Cite
|
Sign up to set email alerts
|

An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences

Abstract: Statistical methodology for the identification and characterization of protein binding sites in a set of unaligned DNA fragments is presented. Each sequence must contain at least one common site. No alignment of the sites is required. Instead, the uncertainty in the location of the sites is handled by employing the missing information principle to develop an "expectation maximization" (EM) algorithm. This approach allows for the simultaneous identification of the sites and characterization of the binding motif… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
333
0

Year Published

1995
1995
2009
2009

Publication Types

Select...
6
3
1

Relationship

2
8

Authors

Journals

citations
Cited by 448 publications
(337 citation statements)
references
References 20 publications
4
333
0
Order By: Relevance
“…The development of these methods has been motivated by the current rapid increase in sequence data because relatively large sets (containing, for example, more than 15 sequences) are needed for weakly conserved patterns to reach statistical significance. Lawrence et al (1993) describe a Gibbs sampling strategy for detecting conserved patterns in multiple sequences that is a stochastic analog of earlier expectation-maximization methods (Lawrence & Reilly, 1990;Cardon & Stormo, 1992) and that is closely related to (EM-based) hidden Markov model multiple sequence alignment methods (Baldi et al, 1994;Krogh et al, 1994), which, unlike the Gibbs sampler, permit gaps anywhere in the sequences. This Gibbs sampler (which is referred to here as the site sampler) addresses the problem of finding motifs when the number of occurrences of each motif in each sequence is assumed.…”
Section: 'mentioning
confidence: 99%
“…The development of these methods has been motivated by the current rapid increase in sequence data because relatively large sets (containing, for example, more than 15 sequences) are needed for weakly conserved patterns to reach statistical significance. Lawrence et al (1993) describe a Gibbs sampling strategy for detecting conserved patterns in multiple sequences that is a stochastic analog of earlier expectation-maximization methods (Lawrence & Reilly, 1990;Cardon & Stormo, 1992) and that is closely related to (EM-based) hidden Markov model multiple sequence alignment methods (Baldi et al, 1994;Krogh et al, 1994), which, unlike the Gibbs sampler, permit gaps anywhere in the sequences. This Gibbs sampler (which is referred to here as the site sampler) addresses the problem of finding motifs when the number of occurrences of each motif in each sequence is assumed.…”
Section: 'mentioning
confidence: 99%
“…Motif discovery methods usually fall in one of two main categories: (i) enumerative, which examine the frequency of all DNA strings and compute overrepresented strings to form a PWM [76][77][78][79] and (ii) probabilistic, which tackle the problem by creating a multiple local alignment of all sequences while simultaneously learning the PWM parameters using methods like expectation-maximization [80][81][82][83], Gibbs sampling [84][85][86][87][88][89] or greedy approaches [90]. Each category has certain advantages over the other.…”
Section: Identification Of Tf Binding Specificitiesmentioning
confidence: 99%
“…The major advantage of EM over k-Means is its ability to model a much richer set of cluster shapes. This generality has made EM (and its many variants and extensions) the clustering algorithm of choice in data mining [7] and bioinformatics [17]. …”
Section: I-em With Expectation Maximization (Em)mentioning
confidence: 99%