Genome-wide in vivo protein-DNA interactions are routinely mapped using high-throughput chromatin immunoprecipitation (ChIP). ChIP-reported regions are typically investigated for enriched sequence-motifs, which are likely to model the DNA-binding specificity of the profiled protein and/or of co-occurring proteins. However, simple enrichment analyses can miss insights into the binding-activity of the protein. Note that ChIP reports regions making direct contact with the protein as well as those binding through intermediaries. For example, consider a ChIP experiment targeting protein X, which binds DNA at its cognate sites, but simultaneously interacts with four other proteins. Each of these proteins also binds to its own specific cognate sites along distant parts of the genome, a scenario consistent with the current view of transcriptional hubs and chromatin loops. Since ChIP will pull down all X-associated regions, the final reported data will be a union of five distinct sets of regions, each containing binding sites of one of the five proteins, respectively. Characterizing all five different motifs and the corresponding sets is important to interpret the ChIP experiment and ultimately, the role of X in regulation. We present diversity which attempts exactly this: it partitions the data so that each partition can be characterized with its own de novo motif. Diversity uses a Bayesian approach to identify the optimal number of motifs and the associated partitions, which together explain the entire dataset. This is in contrast to standard motif finders, which report motifs individually enriched in the data, but do not necessarily explain all reported regions. We show that the different motifs and associated regions identified by diversity give insights into the various complexes that may be forming along the chromatin, something that has so far not been attempted from ChIP data. Webserver at http://diversity.ncl.res.in/; standalone (Mac OS X/Linux) from https://github.com/NarlikarLab/DIVERSITY/releases/tag/v1.0.0.
High-throughput sequencing-based assays measure different biochemical activities pertaining to gene regulation, genome-wide. These activities include protein-DNA binding, enhancer-activity, open chromatin, and more. A major goal is to understand underlying sequence components, or motifs, that can explain the measured activity. It is usually not one motif, but a combination of motifs bound by cooperatively acting proteins that confers activity to such regions. Furthermore, although having a single type of activity, the regions can still be diverse, governed by different combinations of proteins/motifs. Current approaches do not take into account this issue of combinatorial diversity. We present a new statistical framework cisDiversity, which models regions as diverse modules characterized by combinations of motifs, while simultaneously learning the motifs themselves. We show that ChIP-seq data for the CTCF protein in fly contains diverse sequence structures, with most direct CTCF-binding sites situated far from promoters, giving insights into its co-factors and potential role in looping. Human CTCF-bound regions, on the other hand, have a different architecture. Because cisDiversity does not rely on knowledge of motifs, modules, cell-type, or organism, it is general enough to be applied to regions reported by most high-throughput assays. Indeed, enhancer predictions resulting from different assays---GRO-cap, STARR-seq, and those measuring chromatin structure---show distinct modules and combinations of TF binding sites, some specific to the assay. No module occurs universally in all enhancer-assays. Finally, analysis of accessible chromatin suggests that regions open in one cell-state encode information about future states, with certain modules staying open and others closing down later. The code is freely available at https://github.com/NarlikarLab/cisDIVERSITY.
Motivation High-throughput chromatin immunoprecipitation (ChIP) sequencing-based assays capture genomic regions associated with the profiled transcription factor (TF). ChIP-exo is a modified protocol, which uses lambda exonuclease to digest DNA close to the TF-DNA complex, in order to improve on the positional resolution of the TF-DNA contact. Because the digestion occurs in the 5′–3′ orientation, the protocol produces directional footprints close to the complex, on both sides of the double stranded DNA. Like all ChIP-based methods, ChIP-exo reports a mixture of different regions associated with the TF: those bound directly to the TF as well as via intermediaries. However, the distribution of footprints are likely to be indicative of the complex forming at the DNA. Results We present ExoDiversity, which uses a model-based framework to learn a joint distribution over footprints and motifs, thus resolving the mixture of ChIP-exo footprints into diverse binding modes. It uses no prior motif or TF information and automatically learns the number of different modes from the data. We show its application on a wide range of TFs and organisms/cell-types. Because its goal is to explain the complete set of reported regions, it is able to identify co-factor TF motifs that appear in a small fraction of the dataset. Further, ExoDiversity discovers small nucleotide variations within and outside canonical motifs, which co-occur with variations in footprints, suggesting that the TF-DNA structural configuration at those regions is likely to be different. Finally, we show that detected modes have specific DNA shape features and conservation signals, giving insights into the structure and function of the putative TF-DNA complexes. Availability and implementation The code for ExoDiversity is available on https://github.com/NarlikarLab/exoDIVERSITY. Supplementary information Supplementary data are available at Bioinformatics online.
High-throughput sequencing-based assays measure different biochemical activities pertaining to gene regulation, genome-wide. These activities include transcription factor (TF)–DNA binding, enhancer activity, open chromatin, and more. A major goal is to understand underlying sequence components, or motifs, that can explain the measured activity. It is usually not one motif but a combination of motifs bound by cooperatively acting proteins that confers activity to such regions. Furthermore, regions can be diverse, governed by different combinations of TFs/motifs. Current approaches do not take into account this issue of combinatorial diversity. We present a new statistical framework, cisDIVERSITY, which models regions as diverse modules characterized by combinations of motifs while simultaneously learning the motifs themselves. Because cisDIVERSITY does not rely on knowledge of motifs, modules, cell type, or organism, it is general enough to be applied to regions reported by most high-throughput assays. For example, in enhancer predictions resulting from different assays—GRO-cap, STARR-seq, and those measuring chromatin structure—cisDIVERSITY discovers distinct modules and combinations of TF binding sites, some specific to the assay. From protein–DNA binding data, cisDIVERSITY identifies potential cofactors of the profiled TF, whereas from ATAC-seq data, it identifies tissue-specific regulatory modules. Finally, analysis of single-cell ATAC-seq data suggests that regions open in one cell-state encode information about future states, with certain modules staying open and others closing down in the next time point.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.