We develop a method for completing the genetics of natural living systems by which the absence of expected future discoveries can be established. We demonstrate the method using bacteriophage øX174, the first DNA genome to be sequenced. Like many well-studied natural organisms, closely related genome sequences are available—23 Bullavirinae genomes related to øX174. Using bioinformatic tools, we first identified 315 potential open reading frames (ORFs) within the genome, including the 11 established essential genes and 82 highly conserved ORFs that have no known gene products or assigned functions. Using genome-scale design and synthesis, we made a mutant genome in which all 11 essential genes are simultaneously disrupted, leaving intact only the 82 conserved but cryptic ORFs. The resulting genome is not viable. Cell-free gene expression followed by mass spectrometry revealed only a single peptide expressed from both the cryptic ORF and wild-type genomes, suggesting a potential new gene. A second synthetic genome in which 71 conserved cryptic ORFs were simultaneously disrupted is viable but with ∼50% reduced fitness relative to the wild type. However, rather than finding any new genes, repeated evolutionary adaptation revealed a single point mutation that modulates expression of gene H, a known essential gene, and fully suppresses the fitness defect. Taken together, we conclude that the annotation of currently functional ORFs for the øX174 genome is formally complete. More broadly, we show that sequencing and bioinformatics followed by synthesis-enabled reverse genomics, proteomics, and evolutionary adaptation can definitely establish the sufficiency and completeness of natural genome annotations.
Chromatin architecture, a key regulator of gene expression, can be inferred using chromatin contact data from chromosome conformation capture, or Hi-C. However, classical Hi-C does not preserve multi-way contacts. Here we use long sequencing reads to map genome-wide multi-way contacts and investigate higher order chromatin organization in the human genome. We use hypergraph theory for data representation and analysis, and quantify higher order structures in neonatal fibroblasts, biopsied adult fibroblasts, and B lymphocytes. By integrating multi-way contacts with chromatin accessibility, gene expression, and transcription factor binding, we introduce a data-driven method to identify cell type-specific transcription clusters. We provide transcription factor-mediated functional building blocks for cell identity that serve as a global signature for cell types.
Bacteriophage øX174 was the first DNA genome to be sequenced. The genome is well studied by classical methods and is known to encode 11 essential genes. At least 23 closely-related Bullavirinae genome sequences are now available. We identified 315 potential open reading frames (ORFs) within the genome via bioinformatic analysis, and a subset of 82 highly-conserved ORFs that have no known gene products or functions.Using genome scale design and synthesis we made a mutant genome in which all 11 essential genes are simultaneously disrupted, leaving intact only the 82 conserved-butcryptic ORFs. The resulting genome is not viable, as expected. Cell-free gene expression followed by mass spectrometry revealed only a single peptide expressed from both the cryptic-ORF and wild-type genomes, suggesting a potential new gene. A second synthetic genome in which 71 conserved cryptic ORFs were simultaneously disrupted is viable but with ~50% reduced fitness relative to the wild type. However, rather than finding any new genes, repeated evolutionary adaptation revealed a single point mutation modulating translation of gene H, a known essential gene, that fully suppressed the fitness defect. Taken together, we conclude that the annotation of ORFs for the øX174 genome is formally complete. Sequencing and bioinformatics followed by synthesisenabled reverse genomics, proteomics, and evolutionary adaptation can definitely establish the sufficiency and completeness of natural genome annotations.
Summary Every human somatic cell inherits a maternal and a paternal genome, which work together to give rise to cellular phenotypes. However, the allele-specific relationship between gene expression and genome structure through the cell cycle is largely unknown. By integrating haplotype-resolved genome-wide chromosome conformation capture, mature and nascent mRNA, and protein binding data from a B lymphoblastoid cell line, we investigate this relationship both globally and locally. We introduce the maternal and paternal 4D Nucleome, enabling detailed analysis of the mechanisms and dynamics of genome structure and gene function for diploid organisms. Our analyses find significant coordination between allelic expression biases and local genome conformation, and notably absent expression bias in universally essential cell cycle and glycolysis genes. We propose a model in which coordinated biallelic expression reflects prioritized preservation of essential gene sets.
Advancements in DNA sequencing technology have allowed researchers to affordably generate millions of sequence reads from microorganisms in diverse environments. Efficient and robust software tools are needed to assign microbial sequences into taxonomic groups for characterization and comparison of communities.
Highlights• Structural and functional differences between the maternal and paternal genomes, including similar allelic bias in related subsets of genes.• Coupling between the dynamics of gene expression and genome architecture for specific alleles illuminates an allele specific 4D Nucleome.• A novel allele specific phasing algorithm for genome architecture and a quantitative framework for integration of gene expression and genome architecture. AbstractMillions of genetic variants exist between the paternal and maternal genomes in human cells (1, 2), which result in unequal allelic contributions to gene transcription (3, 4). However, it remains poorly understood how allelic bias affects the interplay between transcription and the 3D organization of the genome. We sought to understand how transcription and genome architecture differ between the maternal and paternal genomes across the cell cycle. We collected and analyzed haplotype-resolved genome-wide data from B-Lymphocytes (NA12878) in G1, S, and G2/M, using RNA sequencing (RNAseq), bromouridine sequencing (Bru-seq), and genome wide chromosome conformation capture (Hi-C). In the past, separation of allele specific data was done only through heterozygous single nucleotide variations (SNVs), insertions, and deletions (InDels), as these unique variations allowed DNA sequencing reads to be mapped back to their parental origins. In this paper, we introduce a novel method of phasing Hi-C data using reads assigned through SNVs/InDels to predict the parental origin of nearby reads of unknown origin. This method allows for more structural data to be systematically assigned to a parental origin, and therefore reduces the sparsity of the allele specific Hi-C contact matrices. By integrating allele specific RNA-seq, Bru-seq, and Hi-C data through three phases of the cell cycle, along with publicly available protein binding data, we provide a more comprehensive understanding of architectural and transcriptional differences between the two genomes. These analyses reveal specific patterns in allelic bias, including similar bias characteristics in some groups of related genes. The integration of these data enabled construction of an allele specific 4D Nucleome.
Generating needed cell types using cellular reprogramming is a promising strategy for restoring tissue function in injury or disease. A common method for reprogramming is addition of one or more transcription factors that confer a new function or identity. Advancements in transcription factor selection and delivery have culminated in successful grafting of autologous reprogrammed cells, an early demonstration of their clinical utility. Though cellular reprogramming has been successful in a number of settings, identification of appropriate transcription factors for a particular transformation has been challenging. Computational methods enable more sophisticated prediction of relevant transcription factors for reprogramming by leveraging gene expression data of initial and target cell types, and are built on mathematical frameworks ranging from information theory to control theory. This review highlights the utility and impact of these mathematical frameworks in the field of cellular reprogramming. This article is categorized under: Reproductive System Diseases > Genetics/Genomics/Epigenetics Reproductive System Diseases > Stem Cells and Development Reproductive System Diseases > Computational Models
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.