Cholangiocarcinoma (CCA) is a hepatobiliary malignancy exhibiting high incidence in countries with endemic liver-fluke infection. We analysed 489 CCAs from 10 countries, combining whole-genome (71 cases), targeted/exome, copy-number, gene expression, and DNA methylation information. Integrative clustering defined four CCA clusters – Fluke-Positive CCAs (Clusters 1/2) are enriched in ERBB2 amplifications and TP53 mutations, conversely Fluke-Negative CCAs (Clusters 3/4) exhibit high copy-number alterations and PD-1/PD-L2 expression, or epigenetic mutations (IDH1/2, BAP1) and FGFR/PRKA-related gene rearrangements. Whole-genome analysis highlighted FGFR2 3′UTR deletion as a mechanism of FGFR2 upregulation. Integration of non-coding promoter mutations with protein-DNA binding profiles demonstrates pervasive modulation of H3K27me3-associated sites in CCA. Clusters 1 and 4 exhibit distinct DNA hypermethylation patterns targeting either CpG islands or shores – mutation signature and subclonality analysis suggests that these reflect different mutational pathways. Our results exemplify how genetics, epigenetics and environmental carcinogens can interplay across different geographies to generate distinct molecular subtypes of cancer.
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs, based on three-dimensional structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
SUMMARY DNA sequence is a major determinant of the binding specificity of transcription factors (TFs) for their genomic targets. However, eukaryotic cells often express, at the same time, TFs with highly similar DNA binding motifs but distinct in vivo targets. Currently, it is not well understood how TFs with seemingly identical DNA motifs achieve unique specificities in vivo. Here, we used custom protein binding microarrays to analyze TF specificity for putative binding sites in their genomic sequence context. Using yeast TFs Cbf1 and Tye7 as our case study, we found that binding sites of these bHLH TFs (i.e., E-boxes) are bound differently in vitro and in vivo, depending on their genomic context. Computational analyses suggest that nucleotides outside E-box binding sites contribute to specificity by influencing the 3D structure of DNA binding sites. Thus, local shape of target sites might play a widespread role in achieving regulatory specificity within TF families.
The origin recognition complex (ORC) is an essential DNA replication initiation factor conserved in all eukaryotes. In Saccharomyces cerevisiae, ORC binds to specific DNA elements; however, in higher eukaryotes, ORC exhibits little sequence specificity in vitro or in vivo. We investigated the genome-wide distribution of ORC in Drosophila and found that ORC localizes to specific chromosomal locations in the absence of any discernible simple motif. Although no clear sequence motif emerged, we were able to use machine learning approaches to accurately discriminate between ORC-associated sequences and ORC-free sequences based solely on primary sequence. The complex sequence features that define ORC binding sites are highly correlated with nucleosome positioning signals and likely represent a preferred nucleosomal landscape for ORC association. Open chromatin appears to be the underlying feature that is deterministic for ORC binding. ORC-associated sequences are enriched for the histone variant, H3.3, often at transcription start sites, and depleted for bulk nucleosomes. The density of ORC binding along the chromosome is reflected in the time at which a sequence replicates, with early replicating sequences having a high density of ORC binding. Finally, we found a high concordance between sites of ORC binding and cohesin loading, suggesting that, in addition to DNA replication, ORC may be required for the loading of cohesin on DNA in Drosophila.
DNA binding specificities of transcription factors (TFs) are a key component of gene regulatory processes. Underlying mechanisms that explain the highly specific binding of TFs to their genomic target sites are poorly understood. A better understanding of TF−DNA binding requires the ability to quantitatively model TF binding to accessible DNA as its basic step, before additional in vivo components can be considered. Traditionally, these models were built based on nucleotide sequence. Here, we integrated 3D DNA shape information derived with a high-throughput approach into the modeling of TF binding specificities. Using support vector regression, we trained quantitative models of TF binding specificity based on protein binding microarray (PBM) data for 68 mammalian TFs. The evaluation of our models included crossvalidation on specific PBM array designs, testing across different PBM array designs, and using PBM-trained models to predict relative binding affinities derived from in vitro selection combined with deep sequencing (SELEX-seq). Our results showed that shapeaugmented models compared favorably to sequence-based models. Although both k-mer and DNA shape features can encode interdependencies between nucleotide positions of the binding site, using DNA shape features reduced the dimensionality of the feature space. In addition, analyzing the feature weights of DNA shape-augmented models uncovered TF family-specific structural readout mechanisms that were not revealed by the DNA sequence. As such, this work combines knowledge from structural biology and genomics, and suggests a new path toward understanding TF binding and genome function.protein−DNA recognition | statistical machine learning | support vector regression | protein binding microarray | DNA structure
Summary The human neocortex differs from that of other great apes in several notable regards including altered cell cycle, prolonged corticogenesis, and increased size [1–5]. While these evolutionary changes likely contributed to the origin of distinctively human cognitive faculties, their genetic basis remains almost entirely unknown. Highly conserved non-coding regions showing rapid sequence changes along the human lineage are candidate loci for the development and evolution of uniquely human traits. Several studies have identified human-accelerated enhancers [6–14], but none have linked an expression difference to a specific organismal trait. Here we report the discovery of a human-accelerated regulatory enhancer (HARE5) of FZD8, a receptor of the Wnt pathway implicated in brain development and size [15, 16]. Using transgenic mice, we demonstrate dramatic differences in human and chimpanzee HARE5 activity, with human HARE5 driving early and robust expression at the onset of corticogenesis. Similar to HARE5 activity, FZD8 is expressed in neural progenitors of the developing neocortex [17–19]. Chromosome conformation capture assays reveal HARE5 physically and specifically contacts the core Fzd8 promoter in the mouse embryonic neocortex. To assess the phenotypic consequences of HARE5 activity, we generated transgenic mice in which Fzd8 expression is under control of orthologous enhancers (Pt-HARE5::Fzd8 and Hs-HARE5::Fzd8). In comparison to Pt-HARE5::Fzd8, Hs-HARE5::Fzd8 mice showed marked acceleration of neural progenitor cell cycle and increased brain size. Changes in HARE5 function unique to humans thus alter cell cycle dynamics of a critical population of stem cells during corticogenesis, and may underlie some distinctive anatomical features of the human brain.
Binding of proteins to particular DNA sites across the genome is a primary determinant of specificity in genome maintenance and gene regulation. DNA-binding specificity is encoded at multiple levels, from the detailed biophysical interactions between proteins and DNA, to the assembly of multi-protein complexes. At each level, variation in the mechanisms used to achieve specificity has led to difficulties in constructing and applying simple models of DNA binding. We review the complexities in protein–DNA binding found at multiple levels and discuss how they confound the idea of simple recognition codes. We discuss the impact of new high-throughput technologies for the characterization of protein–DNA binding, and how these technologies are uncovering new complexities in protein–DNA recognition. Finally, we review the concept of multi-protein recognition codes in which new DNA-binding specificities are achieved by the assembly of multi-protein complexes.
Until now, it has been reasonably assumed that specific base-pair recognition is the only mechanism controlling the specificity of transcription factor (TF)−DNA binding. Contrary to this assumption, here we show that nonspecific DNA sequences possessing certain repeat symmetries, when present outside of specific TF binding sites (TFBSs), statistically control TF−DNA binding preferences. We used highthroughput protein−DNA binding assays to measure the binding levels and free energies of binding for several human TFs to tens of thousands of short DNA sequences with varying repeat symmetries. Based on statistical mechanics modeling, we identify a new protein−DNA binding mechanism induced by DNA sequence symmetry in the absence of specific base-pair recognition, and experimentally demonstrate that this mechanism indeed governs protein−DNA binding preferences. protein−DNA binding is an important biophysical mechanism operating in a living cell (1). This seminal work makes it possible to interpret experiments that measured how transcription factors (TFs) search for their specific target sites flanked by nonconsensus sequence elements (1-10). A specific consensus motif is a short DNA sequence, typically 6-20 base pairs (bp), that possesses an enhanced binding affinity for a particular TF. For example, the sequence CACGTG represents the specific consensus motif for the human protein Max used in this study (Fig. 1). The process of establishing specific, consensus protein−DNA binding requires the formation of precise geometrical fit between the protein and its consensus DNA motif, accompanied by the formation of specific hydrogen and electrostatic contacts at the protein−DNA binding interface (6, 7) ( Fig. 1). In addition to binding to their consensus DNA motifs, transcription factors can also bind, albeit with lower affinity, to DNA regions lacking any consensus motifs. The term "nonspecific protein−DNA binding" (6) is typically used to describe these weaker interactions. Von Hippel and Berg suggested classifying nonspecific protein−DNA binding into two related mechanisms (6). The first mechanism includes protein binding to its mutated specific motifs that retain some residual, reduced specificity. The second mechanism is largely DNA sequence independent, and it involves electrostatic binding modulated by the overall DNA geometry (6). Despite significant experimental progress, molecular mechanisms responsible for these two types of nonspecific binding remain poorly understood, and the free energy of nonspecific protein−DNA binding has not been systematically characterized (11)(12)(13)(14). The interplay between consensus and nonconsensus DNA sequence elements emerges as a dominant factor that governs protein−DNA binding preferences. However, this interplay is also poorly understood (15, 16). Until now, it has been reasonably assumed that specific (consensus) base-pair recognition must control the genome-wide specificity of TF−DNA binding.Contrary to this assumption, here we identify a general mechanism for protein−DNA bi...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.