Rahul Siddharthan scite author profile

A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and “background” intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motif-finding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based on...

show abstract

Rapid evolution of Cse4p-rich centromeric DNA sequences in closely related pathogenic yeasts, Candida albicans and Candida dubliniensis

Padmanabhan

Thakur

Siddharthan

et al. 2008

Proc. Natl. Acad. Sci. U.S.A.

104

View full text Add to dashboard Cite

The Cse4p-containing centromere regions of Candida albicans have unique and different DNA sequences on each of the eight chromosomes. In a closely related yeast, C. dubliniensis, we have identified the centromeric histone, CdCse4p, and shown that it is localized at the kinetochore. We have identified putative centromeric regions, orthologous to the C. albicans centromeres, in each of the eight C. dubliniensis chromosomes by bioinformatic analysis. Chromatin immunoprecipitation followed by PCR using a specific set of primers confirmed that these regions bind CdCse4p in vivo. As in C. albicans, the CdCse4p-associated core centromeric regions are 3-5 kb in length and show no sequence similarity to one another. Comparative sequence analysis suggests that the Cse4p-rich centromere DNA sequences in these two species have diverged faster than other orthologous intergenic regions and even faster than our best estimated ''neutral'' mutation rate. However, the location of the centromere and the relative position of Cse4p-rich centromeric chromatin in the orthologous regions with respect to adjacent ORFs are conserved in both species, suggesting that centromere identity is not solely determined by DNA sequence. Unlike known point and regional centromeres of other organisms, centromeres in C. albicans and C. dubliniensis have no common centromere-specific sequence motifs or repeats except some of the chromosome-specific pericentric repeats that are found to be similar in these two species. We propose that centromeres of these two Candida species are of an intermediate type between point and regional centromeres.chromatin ͉ chromosome segregation ͉ kinetochore ͉ nucleosome ͉ pericentric

show abstract

PhyloGibbs: A Gibbs Sampling Motif Finder that Incorporates Phylogeny

Siddharthan¹,

Siggia²,

Nimwegen³

2005

PLoS Comp Biol

103

View full text Add to dashboard Cite

A central problem in the bioinformatics of gene regulation is to find the binding sites for regulatory proteins. One of the most promising approaches toward identifying these short and fuzzy sequence patterns is the comparative analysis of orthologous intergenic regions of related species. This analysis is complicated by various factors. First, one needs to take the phylogenetic relationship between the species into account in order to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to evolutionary proximity. Second, one has to deal with the complexities of multiple alignments of orthologous intergenic regions, and one has to consider the possibility that functional sites may occur outside of conserved segments. Here we present a new motif sampling algorithm, PhyloGibbs, that runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors (TFs) can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by a model for the evolution of binding sites and ''background'' intergenic DNA. This model takes the phylogenetic relationship between the species in the alignment explicitly into account. The algorithm uses simulated annealing and Monte Carlo Markovchain sampling to rigorously assign posterior probabilities to all the binding sites that it reports. In tests on synthetic data and real data from five Saccharomyces species our algorithm performs significantly better than four other motiffinding algorithms, including algorithms that also take phylogeny into account. Our results also show that, in contrast to the other algorithms, PhyloGibbs can make realistic estimates of the reliability of its predictions. Our tests suggest that, running on the five-species multiple alignment of a single gene's upstream region, PhyloGibbs on average recovers over 50% of all binding sites in S. cerevisiae at a specificity of about 50%, and 33% of all binding sites at a specificity of about 85%. We also tested PhyloGibbs on collections of multiple alignments of intergenic regions that were recently annotated, based on ChIP-on-chip data, to contain binding sites for the same TF. We compared PhyloGibbs's results with the previous analysis of these data using six other motif-finding algorithms. For 16 of 21 TFs for which all other motif-finding methods failed to find a significant motif, PhyloGibbs did recover a motif that matches the literature consensus. In 11 cases where there was disagreement in the results we compiled lists of known target genes from the literature, and found that running PhyloGibbs on their regulatory regions yielded a binding motif matching the literature consensus in all but one of the cases. Interestingly, these literature gene lists had little overlap with the targets annotated based o...

show abstract

Dinucleotide Weight Matrices for Predicting Transcription Factor Binding Sites: Generalizing the Position Weight Matrix

Siddharthan

2010

PLoS ONE

View full text Add to dashboard Cite

BackgroundIdentifying transcription factor binding sites (TFBS) in silico is key in understanding gene regulation. TFBS are string patterns that exhibit some variability, commonly modelled as “position weight matrices” (PWMs). Though convenient, the PWM has significant limitations, in particular the assumed independence of positions within the binding motif; and predictions based on PWMs are usually not very specific to known functional sites. Analysis here on binding sites in yeast suggests that correlation of dinucleotides is not limited to near-neighbours, but can extend over considerable gaps.Methodology/Principal FindingsI describe a straightforward generalization of the PWM model, that considers frequencies of dinucleotides instead of individual nucleotides. Unlike previous efforts, this method considers all dinucleotides within an extended binding region, and does not make an attempt to determine a priori the significance of particular dinucleotide correlations. I describe how to use a “dinucleotide weight matrix” (DWM) to predict binding sites, dealing in particular with the complication that its entries are not independent probabilities. Benchmarks show, for many factors, a dramatic improvement over PWMs in precision of predicting known targets. In most cases, significant further improvement arises by extending the commonly defined “core motifs” by about 10bp on either side. Though this flanking sequence shows no strong motif at the nucleotide level, the predictive power of the dinucleotide model suggests that the “signature” in DNA sequence of protein-binding affinity extends beyond the core protein-DNA contact region.Conclusion/SignificanceWhile computationally more demanding and slower than PWM-based approaches, this dinucleotide method is straightforward, both conceptually and in implementation, and can serve as a basis for future improvements.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.