Motivation: The increasing throughput of sequencing technologies offers new applications and challenges for computational biology. In many of those applications, sequencing errors need to be corrected. This is particularly important when sequencing reads from an unknown reference such as random DNA barcodes. In this case, error correction can be done by performing a pairwise comparison of all the barcodes, which is a computationally complex problem.Results: Here, we address this challenge and describe an exact algorithm to determine which pairs of sequences lie within a given Levenshtein distance. For error correction or redundancy reduction purposes, matched pairs are then merged into clusters of similar sequences. The efficiency of starcode is attributable to the poucet search, a novel implementation of the Needleman–Wunsch algorithm performed on the nodes of a trie. On the task of matching random barcodes, starcode outperforms sequence clustering algorithms in both speed and precision.Availability and implementation: The C source code is available at http://github.com/gui11aume/starcode.Contact: guillaume.filion@gmail.com
To investigate the three-dimensional (3D) genome architecture across normal B cell differentiation and in neoplastic cells from different subtypes of chronic lymphocytic leukemia and mantle cell lymphoma patients, here we integrate in situ Hi-C and nine additional omics layers. Beyond conventional active (A) and inactive (B) compartments, we uncover a highly-dynamic intermediate compartment enriched in poised and polycomb-repressed chromatin. During B cell development, 28% of the compartments change, mostly involving a widespread chromatin activation from naive to germinal center B cells and a reversal to the naive state upon further maturation into memory B cells. B cell neoplasms are characterized by both entity and subtype-specific alterations in 3D genome organization, including large chromatin blocks spanning key disease-specific genes. This study indicates that 3D genome interactions are extensively modulated during normal B cell differentiation and that the genome of B cell neoplasias acquires a tumor-specific 3D genome architecture.
The rapid development of Chromosome Conformation Capture (3C-based techniques), as well as imaging together with bioinformatics analyses, has been fundamental for unveiling that chromosomes are organized into the so-called topologically associating domains or TADs. While TADs appear as nested patterns in the 3C-based interaction matrices, the vast majority of available TAD callers are based on the hypothesis that TADs are individual and unrelated chromatin structures. Here we introduce TADpole, a computational tool designed to identify and analyze the entire hierarchy of TADs in intra-chromosomal interaction matrices. TADpole combines principal component analysis and constrained hierarchical clustering to provide a set of significant hierarchical chromatin levels in a genomic region of interest. TADpole is robust to data resolution, normalization strategy and sequencing depth. Domain borders defined by TADpole are enriched in main architectural proteins (CTCF and cohesin complex subunits) and in the histone mark H3K4me3, while their domain bodies, depending on their activation-state, are enriched in either H3K36me3 or H3K27me3, highlighting that TADpole is able to distinguish functional TAD units. Additionally, we demonstrate that TADpole's hierarchical annotation, together with the new DiffT score, allows for detecting significant topological differences on Capture Hi-C maps between wild-type and genetically engineered mouse.
The prediction of protein folding rates is a necessary step towards understanding the principles of protein folding. Due to the increasing amount of experimental data, numerous protein folding models and predictors of protein folding rates have been developed in the last decade. The problem has also attracted the attention of scientists from computational fields, which led to the publication of several machine learning-based models to predict the rate of protein folding. Some of them claim to predict the logarithm of protein folding rate with an accuracy greater than 90%. However, there are reasons to believe that such claims are exaggerated due to large fluctuations and overfitting of the estimates. When we confronted three selected published models with new data, we found a much lower predictive power than reported in the original publications. Overly optimistic predictive powers appear from violations of the basic principles of machine-learning. We highlight common misconceptions in the studies claiming excessive predictive power and propose to use learning curves as a safeguard against those mistakes. As an example, we show that the current amount of experimental data is insufficient to build a linear predictor of logarithms of folding rates based on protein amino acid composition.
Background One of the most unusual sources of phylogenetically restricted genes is the molecular domestication of transposable elements into a host genome as functional genes. Although these kinds of events are sometimes at the core of key macroevolutionary changes, their origin and organismal function are generally poorly understood. Results Here, we identify several previously unreported transposable element domestication events in the human and mouse genomes. Among them, we find a remarkable molecular domestication that gave rise to a multigenic family in placental mammals, the Bex/Tceal gene cluster. These genes, which act as hub proteins within diverse signaling pathways, have been associated with neurological features of human patients carrying genomic microdeletions in chromosome X. The Bex/Tceal genes display neural-enriched patterns and are differentially expressed in human neurological disorders, such as autism and schizophrenia. Two different murine alleles of the cluster member Bex3 display morphological and physiopathological brain modifications, such as reduced interneuron number and hippocampal electrophysiological imbalance, alterations that translate into distinct behavioral phenotypes. Conclusions We provide an in-depth understanding of the emergence of a gene cluster that originated by transposon domestication and gene duplication at the origin of placental mammals, an evolutionary process that transformed a non-functional transposon sequence into novel components of the eutherian genome. These genes were integrated into existing signaling pathways involved in the development, maintenance, and function of the CNS in eutherians. At least one of its members, Bex3, is relevant for higher brain functions in placental mammals and may be involved in human neurological disorders.
Motivation: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is the standard method to investigate chromatin protein composition. As the number of community-available ChIP-seq profiles increases, it becomes more common to use data from different sources, which makes joint analysis challenging. Issues such as lack of reproducibility, heterogeneous quality and conflicts between replicates become evident when comparing datasets, especially when they are produced by different laboratories. Results: Here, we present Zerone, a ChIP-seq discretizer with built-in quality control. Zerone is powered by a Hidden Markov Model with zero-inflated negative multinomial emissions, which allows it to merge several replicates into a single discretized profile. To identify low quality or irreproducible data, we trained a Support Vector Machine and integrated it as part of the discretization process. The result is a classifier reaching 95% accuracy in detecting low quality profiles. We also introduce a graphical representation to compare discretization quality and we show that Zerone achieves outstanding accuracy. Finally, on current hardware, Zerone discretizes a ChIP-seq experiment on mammalian genomes in about 5 min using less than 700 MB of memory. Availability and Implementation: Zerone is available as a command line tool and as an R package. The C source code and R scripts can be downloaded from https://github.com/nanakiksc/zerone. The information to reproduce the benchmark and the figures is stored in a public Docker image that can be downloaded from https://hub.docker.com/r/nanakiksc/zerone/. Contact: guillaume.filion@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.
Despite recent advances, the dynamics of genome architecture and chromatin function during human cell differentiation and its potential reorganization upon neoplastic transformation remains poorly characterized. Here, we integrate in situ Hi-C and nine additional omic layers to define and biologically characterize the dynamic changes in three-dimensional (3D) genome architecture across normal B cell differentiation and in neoplastic cells from different subtypes of chronic lymphocytic leukemia (CLL) and mantle cell lymphoma (MCL) patients. Beyond conventional active (A) and inactive (B) compartments, an integrative analysis of Hi-C data reveals the presence of a highly-dynamic intermediate compartment enriched in poised and polycomb-repressed chromatin. During B cell development, we detect that 28% of the compartments change at defined maturation stages and mostly involve the intermediate compartment. The transition from naive to germinal center B cells is associated with widespread chromatin activation, which mostly reverts into the naive state upon further maturation of germinal center cells into memory B cells. The analysis of CLL and MCL neoplastic cells points both to entity and subtype-specific alterations in chromosome organization. Remarkably, we observe that large chromatin blocks containing key disease-specific genes alter their 3D genome organization. These include the inactivation of a 2Mb region containing the EBF1 gene in CLL and the activation of a 6.1Mb region containing the SOX11 gene in clinically aggressive MCL. This study indicates that 3D genome interactions are extensively modulated during normal B cell differentiation and that the genome of B cell neoplasias acquires a tumor-specific 3D genome architecture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.