Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences.
Chromatin topology is intricately linked to gene expression, yet its functional requirement remains unclear. Here, we comprehensively assessed the interplay between genome topology and gene expression using highly rearranged chromosomes (balancers) spanning ~75% of the Drosophila genome. Using transheterozyte (balancer/wild-type) embryos, we measured allele-specific changes in topology and gene expression in cis , whilst minimizing trans effects. Through genome sequencing, we resolved eight large nested inversions, smaller inversions, duplications, and thousands of deletions. These extensive rearrangements caused many changes to chromatin topology, including long-range loops, TADs and promoter interactions, yet these are not predictive of changes in expression. Gene expression is generally not altered around inversion breakpoints, indicating that mis-appropriate enhancer-promoter activation is a rare event. Similarly, shuffling or fusing TADs, changing intra-TAD connections and disrupting long-range inter-TAD loops, does not alter expression for the majority of genes. Our results suggest that properties other than chromatin topology ensure productive enhancer-promoter interactions.
The SOXE transcription factors SOX8, SOX9 and SOX10 are master regulators of mammalian development directing sex determination, gliogenesis, pancreas specification and neural crest development. We identified a set of palindromic SOX binding sites specifically enriched in regulatory regions of melanoma cells. SOXE proteins homodimerize on these sequences with high cooperativity. In contrast to other transcription factor dimers, which are typically rigidly spaced, SOXE group proteins can bind cooperatively at a wide range of dimer spacings. Using truncated forms of SOXE proteins, we show that a single dimerization (DIM) domain, that precedes the DNA binding high mobility group (HMG) domain, is sufficient for dimer formation, suggesting that DIM : HMG rather than DIM:DIM interactions mediate the dimerization. All SOXE members can also heterodimerize in this fashion, whereas SOXE heterodimers with SOX2, SOX4, SOX6 and SOX18 are not supported. We propose a structural model where SOXE-specific intramolecular DIM:HMG interactions are allosterically communicated to the HMG of juxtaposed molecules. Collectively, SOXE factors evolved a unique mode to combinatorially regulate their target genes that relies on a multifaceted interplay between the HMG and DIM domains. This property potentially extends further the diversity of target genes and cell-specific functions that are regulated by SOXE proteins.
The binding of transcription factors (TFs) to their specific motifs in genomic regulatory regions is commonly studied in isolation. However, in order to elucidate the mechanisms of transcriptional regulation, it is essential to determine which TFs bind DNA cooperatively as dimers and to infer the precise nature of these interactions. So far, only a small number of such dimeric complexes are known. Here, we present an algorithm for predicting cell-type-specific TF-TF dimerization on DNA on a large scale, using DNase I hypersensitivity data from 78 human cell lines. We represented the universe of possible TF complexes by their corresponding motif complexes, and analyzed their occurrence at cell-type-specific DNase I hypersensitive sites. Based on~1.4 billion tests for motif complex enrichment, we predicted 603 highly significant celltype-specific TF dimers, the vast majority of which are novel. Our predictions included 76% (19/25) of the known dimeric complexes and showed significant overlap with an experimental database of protein-protein interactions. They were also independently supported by evolutionary conservation, as well as quantitative variation in DNase I digestion patterns. Notably, the known and predicted TF dimers were almost always highly compact and rigidly spaced, suggesting that TFs dimerize in close proximity to their partners, which results in strict constraints on the structure of the DNA-bound complex. Overall, our results indicate that chromatin openness profiles are highly predictive of cell-type-specific TF-TF interactions. Moreover, cooperative TF dimerization seems to be a widespread phenomenon, with multiple TF complexes predicted in most cell types.
BackgroundCooperative binding of transcription factor (TF) dimers to DNA is increasingly recognized as a major contributor to binding specificity. However, it is likely that the set of known TF dimers is highly incomplete, given that they were discovered using ad hoc approaches, or through computational analyses of limited datasets.ResultsHere, we present TACO (Transcription factor Association from Complex Overrepresentation), a general-purpose standalone software tool that takes as input any genome-wide set of regulatory elements and predicts cell-type–specific TF dimers based on enrichment of motif complexes. TACO is the first tool that can accommodate motif complexes composed of overlapping motifs, a characteristic feature of many known TF dimers. Our method comprehensively outperforms existing tools when benchmarked on a reference set of 29 known dimers. We demonstrate the utility and consistency of TACO by applying it to 152 DNase-seq datasets and 94 ChIP-seq datasets.ConclusionsBased on these results, we uncover a general principle governing the structure of TF-TF-DNA ternary complexes, namely that the flexibility of the complex is correlated with, and most likely a consequence of, inter-motif spacing.
FOXA1 is a transcription factor capable to bind silenced chromatin to direct context-dependent cell fate conversion. Here, we demonstrate that a compact palindromic DNA element (termed ‘DIV’ for its diverging half-sites) induces the homodimerization of FOXA1 with strongly positive cooperativity. Alternative structural models are consistent with either an indirect DNA-mediated cooperativity or a direct protein-protein interaction. The cooperative homodimer formation is strictly constrained by precise half-site spacing. Re-analysis of chromatin immunoprecipitation sequencing data indicates that the DIV is effectively targeted by FOXA1 in the context of chromatin. Reporter assays show that FOXA1-dependent transcriptional activity declines when homodimeric binding is disrupted. In response to phosphatidylinositol-3 kinase inhibition DIV sites pre-bound by FOXA1 such as at the PVT1/MYC locus exhibit a strong increase in accessibility suggesting a role of the DIV configuration in the chromatin closed-open dynamics. Moreover, several disease-associated single nucleotide polymorphisms map to DIV elements and show allelic differences in FOXA1 homodimerization, reporter gene expression and are annotated as quantitative trait loci. This includes the rs541455835 variant at the MAPT locus encoding the Tau protein associated with Parkinson's disease. Collectively, the DIV guides chromatin engagement and regulation by FOXA1 and its perturbation could be linked to disease etiologies.
Motivation: Computational prediction of transcription factor (TF) binding sites in the genome remains a challenging task. Here, we present Romulus, a novel computational method for identifying individual TF binding sites from genome sequence information and cell-type–specific experimental data, such as DNase-seq. It combines the strengths of previous approaches, and improves robustness by reducing the number of free parameters in the model by an order of magnitude.Results: We show that Romulus significantly outperforms existing methods across three sources of DNase-seq data, by assessing the performance of these tools against ChIP-seq profiles. The difference was particularly significant when applied to binding site prediction for low-information-content motifs. Our method is capable of inferring multiple binding modes for a single TF, which differ in their DNase I cut profile. Finally, using the model learned by Romulus and ChIP-seq data, we introduce Binding in Closed Chromatin (BCC) as a quantitative measure of TF pioneer factor activity. Uniquely, our measure quantifies a defining feature of pioneer factors, namely their ability to bind closed chromatin.Availability and Implementation: Romulus is freely available as an R package at http://github.com/ajank/Romulus.Contact: ajank@mimuw.edu.plSupplementary information: Supplementary data are available at Bioinformatics online.
Algorithms for estimating similarity between two macromolecular sequences are of profound importance for molecular biology. The standard methods utilize so-called primary structure, that is a string of characters denoting the sequence of monomers in hetero-polymer. These methods find the substrings of maximal similarity, as defined by the so-called similarity matrix, for a pair of two molecules. The problem is solved either by the exact dynamic programming method, or by approximate heuristic methods.The approximate algorithms are almost two orders of magnitude faster in comparison with the standard version of the exact Smith-Waterman algorithm, when executed on the same hardware, hence the exact algorithm is relatively rarely used. Recently a very efficient implementation of Smith-Waterman algorithm utilizing SIMD extensions to the standard instruction set reduced the speed Implementation presented here achieves execution speed of approximately 9 GCUPS. The performance is independent on the scoring system. It is 4 to 10 times faster than best Smith-Waterman implementation running on a PC and 1.5 to 3 times faster than the same implementation running on Sony PlayStation 3. It is also 5 times faster than the recent implementation of the Smith-Waterman utilizing Nvidia GPU.Our implementation running on Sony PlayStation 3 has performance which is directly comparable with that of BLAST running on PC, being up to 4 times faster in the best case and no more than two times slower in the worst case. This performance level opens possibility for using the exact Smith-Waterman algorithm in applications, where currently approximate algorithms are used.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.