Transcriptional enhancers are critical for development and phenotype evolution and are often mutated in disease contexts; however, even in well-studied cell types, the sequence code conferring enhancer activity remains unknown. To examine the enhancer regulatory code for pluripotent stem cells, we identified genomic regions with conserved binding of multiple transcription factors in mouse and human embryonic stem cells (ESCs). Examination of these regions revealed that they contain on average 12.6 conserved transcription factor binding site (TFBS) sequences. Enriched TFBSs are a diverse repertoire of 70 different sequences representing the binding sequences of both known and novel ESC regulators. Using a diverse set of TFBSs from this repertoire was sufficient to construct short synthetic enhancers with activity comparable to native enhancers. Site-directed mutagenesis of conserved TFBSs in endogenous enhancers or TFBS deletion from synthetic sequences revealed a requirement for 10 or more different TFBSs. Furthermore, specific TFBSs, including the POU5F1:SOX2 comotif, are dispensable, despite cobinding the POU5F1 (also known as OCT4), SOX2, and NANOG master regulators of pluripotency. These findings reveal that a TFBS sequence diversity threshold overrides the need for optimized regulatory grammar and individual TFBSs that recruit specific master regulators.
Abstract. Content-based medical image retrieval can support diagnostic decisions by clinical experts. Examining similar images may provide clues to the expert to remove uncertainties in his/her final diagnosis. Beyond conventional feature descriptors, binary features in different ways have been recently proposed to encode the image content. A recent proposal is "Radon barcodes" that employ binarized Radon projections to tag/annotate medical images with content-based binary vectors, called barcodes. In this paper, MinMax Radon barcodes are introduced which are superior to "local thresholding" scheme suggested in the literature. Using IRMA dataset with 14,410 x-ray images from 193 different classes, the advantage of using MinMax Radon barcodes over thresholded Radon barcodes are demonstrated. The retrieval error for direct search drops by more than 15%. As well, SURF, as a well-established non-binary approach, and BRISK, as a recent binary method are examined to compare their results with MinMax Radon barcodes when retrieving images from IRMA dataset. The results demonstrate that MinMax Radon barcodes are faster and more accurate when applied on IRMA images.
Transcriptional enhancers are critical for development, phenotype evolution and often mutated in disease contexts; however, even in well-studied cell types, the sequence code conferring enhancer activity remains unknown. We found genomic regions with conserved binding of multiple transcription factors (TFs) in mouse and human embryonic stem cells (ESCs) are enriched in a diverse repertoire of transcription factor binding sites (TFBS) including known and novel ESC regulators. Remarkably, using a diverse set of TFBS from this repertoire was sufficient to construct short synthetic enhancers with activity comparable to native enhancers. Site directed mutagenesis of conserved TFBS in endogenous enhancers or TFBS deletion from synthetic sequences revealed a requirement for >10 different TFBS for robust activity. Specific TFBS, including the OCT4:SOX2 co-motif, are dispensable, despite co-binding the OCT4, SOX2 and NANOG master regulators of pluripotency. These findings reveal a TFBS diversity threshold overrides the need for optimized regulatory grammar and individual TFBS that bind specific master regulators.
Highlights: Comparative epigenomics determines the enhancer sequence code and synthetic enhancer design. A diversity of >10 TFBS are required and sufficient for robust enhancer activity >40 TFBS contribute to enhancer activity implicating new TFs in pluripotency maintenance Increased TFBS diversity, above a threshold, overrides the need for regulatory grammar
Motivation:The 3D genome architecture influences the regulation of genes by facilitating chromatin interactions between distal cis-regulatory elements and gene promoters. We implement Cross Cell-type Correlation based on DNA accessibility (C3D), a highly customizable computational tool that predicts chromatin interactions using an unsupervised algorithm that utilizes correlations in chromatin measurements, such as DNaseI hypersensitivity signals.Results: C3D accurately predicts 32.7%, 18.3% and 24.1% of interactions, validated by ChIA-PET assays, between promoters and distal regions that overlie DNaseI hypersensitive sites in K562, MCF-7 and GM12878 cells, respectively.
Availability:Source code is open-source and freely available on GitHub (https://github.com/Lu-pienLabOrganization/C3D) under the GNU GPLv3 license. C3D is implemented in Bash and R; it runs on any platform with Bash (≥4.0), R (≥3.1.1) and BEDTools (≥2.19.0). It requires the following R packages: GenomicRanges, Sushi, data.table, preprocessCore and dynamicTreeCut.
Motivation
Mammalian genomes can contain thousands of enhancers but only a subset are actively driving gene expression in a given cellular context. Integrated genomic datasets can be harnessed to predict active enhancers. One challenge in integration of large genomic datasets is the increasing heterogeneity: continuous, binary and discrete features may all be relevant. Coupled with the typically small numbers of training examples, semi-supervised approaches for heterogeneous data are needed; however, current enhancer prediction methods are not designed to handle heterogeneous data in the semi-supervised paradigm.
Results
We implemented a Dirichlet Process Heterogeneous Mixture model that infers Gaussian, Bernoulli and Poisson distributions over features. We derived a novel variational inference algorithm to handle semi-supervised learning tasks where certain observations are forced to cluster together. We applied this model to enhancer candidates in mouse heart tissues based on heterogeneous features. We constrained a small number of known active enhancers to appear in the same cluster, and 47 additional regions clustered with them. Many of these are located near heart-specific genes. The model also predicted 1176 active promoters, suggesting that it can discover new enhancers and promoters.
Availability and implementation
We created the ‘dphmix’ Python package: https://pypi.org/project/dphmix/.
Supplementary information
Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.