Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
Microbial community profiling using 16S rRNA gene sequences requires accurate taxonomy assignments. ‘Universal' primers target conserved sequences and amplify sequences from many taxa, but they provide variable coverage of different environments, and regions of the rRNA gene differ in taxonomic informativeness—especially when high-throughput short-read sequencing technologies (for example, 454 and Illumina) are used. We introduce a new evaluation procedure that provides an improved measure of expected taxonomic precision when classifying environmental sequence reads from a given primer. Applying this measure to thousands of combinations of primers and read lengths, simulating single-ended and paired-end sequencing, reveals that these choices greatly affect taxonomic informativeness. The most informative sequence region may differ by environment, partly due to variable coverage of different environments in reference databases. Using our Rtax method of classifying paired-end reads, we found that paired-end sequencing provides substantial benefit in some environments including human gut, but not in others. Optimal primer choice for short reads totaling 96 nt provides 82–100% of the confident genus classifications available from longer reads.
M ost human genes exhibit alternative splicing, but not all alternatively spliced transcripts produce functional proteins. Computational and experimental results indicate that a substantial fraction of alternative splicing events in humans result in mRNA isoforms that harbor a premature termination codon (PTC). These transcripts are predicted to be degraded by the nonsense-mediated mRNA decay (NMD) pathway. One explanation for the abundance of PTC-containing isoforms is that they represent splicing errors that are identified and degraded by the NMD pathway. Another potential explanation for this startling observation is that cells may link alternative splicing and NMD to regulate the abundance of mRNA transcripts. This mechanism, which we call "Regulated Unproductive Splicing and Translation" (RUST), has been experimentally shown to regulate expression of a wide variety of genes in many organisms from yeast to human. It is frequently employed for autoregulation of proteins that affect the splicing process itself. Thus, alternative splicing and NMD act together to play an important role in regulating gene expression.
BackgroundThe gut microbiome is altered in Crohn’s disease. Although individual taxa have been correlated with post-operative clinical course, global trends in microbial diversity have not been described in this context.MethodsWe collected mucosal biopsies from the terminal ileum and ascending colon during surgery and post-operative colonoscopy in 6 Crohn’s patients undergoing ileocolic resection (and 40 additional Crohn’s and healthy control patients undergoing either surgery or colonoscopy). Using next-generation sequencing technology, we profiled the gut microbiota in order to identify changes associated with remission or recurrence of inflammation.ResultsWe performed 16S ribosomal profiling using 101 base-pair single-end sequencing on the Illumina GAIIx platform with deep coverage, at an average depth of 1.3 million high quality reads per sample. At the time of surgery, Crohn’s patients who would remain in remission were more similar to controls and more species-rich than Crohn’s patients with subsequent recurrence. Patients remaining in remission also exhibited greater stability of the microbiota through time.ConclusionsThese observations permitted an association of gut microbial profiles with probability of recurrence in this limited single-center study. These results suggest that profiling the gut microbiota may be useful in guiding treatment of Crohn’s patients undergoing surgery.
Errors in scientific results due to software bugs are not limited to a few high-profile cases that lead to retractions and are widely reported. Here I estimate that in fact most scientific results are probably wrong if data have passed through a computer, and that these errors may remain largely undetected. The opportunities for both subtle and profound errors in software and data management are boundless, yet they remain surprisingly underappreciated.
Errors in scientific results due to software bugs are not limited to a few high-profile cases that lead to retractions and are widely reported. Here I estimate that in fact most scientific results are probably wrong if data have passed through a computer, and that these errors may remain largely undetected. The opportunities for both subtle and profound errors in software and data management are boundless, yet they remain surprisingly underappreciated.
We present a framework for specifying, training, evaluating, and deploying machine learning models. Our focus is on simplifying cu ing edge machine learning for practitioners in order to bring such technologies into production. Recognizing the fast evolution of the eld of deep learning, we make no a empt to capture the design space of all possible model architectures in a domain-speci c language (DSL) or similar con guration language. We allow users to write code to de ne their models, but provide abstractions that guide developers to write models in ways conducive to productionization. We also provide a unifying Estimator interface, making it possible to write downstream infrastructure (e.g. distributed training, hyperparameter tuning) independent of the model implementation.We balance the competing demands for exibility and simplicity by o ering APIs at di erent levels of abstraction, making common model architectures available out of the box, while providing a library of utilities designed to speed up experimentation with model architectures. To make out of the box models exible and usable across a wide range of problems, these canned Estimators are parameterized not only over traditional hyperparameters, but also using feature columns, a declarative speci cation describing how to interpret input data.We discuss our experience in using this framework in research and production environments, and show the impact on code health, maintainability, and development speed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.