The ability to predict local structural features of a protein from the primary sequence is of paramount importance for unraveling its function in absence of experimental structural information. Two main factors affect the utility of potential prediction tools: their accuracy must enable extraction of reliable structural information on the proteins of interest, and their runtime must be low to keep pace with sequencing data being generated at a constantly increasing speed. Here, we present NetSurfP‐2.0, a novel tool that can predict the most important local structural features with unprecedented accuracy and runtime. NetSurfP‐2.0 is sequence‐based and uses an architecture composed of convolutional and long short‐term memory neural networks trained on solved protein structures. Using a single integrated model, NetSurfP‐2.0 predicts solvent accessibility, secondary structure, structural disorder, and backbone dihedral angles for each residue of the input sequences. We assessed the accuracy of NetSurfP‐2.0 on several independent test datasets and found it to consistently produce state‐of‐the‐art predictions for each of its output features. We observe a correlation of 80% between predictions and experimental data for solvent accessibility, and a precision of 85% on secondary structure 3‐class predictions. In addition to improved accuracy, the processing time has been optimized to allow predicting more than 1000 proteins in less than 2 hours, and complete proteomes in less than 1 day.
We comprehensively assessed the contribution of the Shine-Dalgarno sequence to protein expression and used the data to develop EMOPEC (Empirical Model and Oligos for Protein Expression Changes; http://emopec.biosustain.dtu.dk). EMOPEC is a free tool that makes it possible to modulate the expression level of any Escherichia coli gene by changing only a few bases. Measured protein levels for 91% of our designed sequences were within twofold of the desired target level.
The accurate structural modeling of B- and T-cell receptors is fundamental to gain a detailed insight in the mechanisms underlying immunity and in developing new drugs and therapies. The LYRA (LYmphocyte Receptor Automated modeling) web server (http://www.cbs.dtu.dk/services/LYRA/) implements a complete and automated method for building of B- and T-cell receptor structural models starting from their amino acid sequence alone. The webserver is freely available and easy to use for non-specialists. Upon submission, LYRA automatically generates alignments using ad hoc profiles, predicts the structural class of each hypervariable loop, selects the best templates in an automatic fashion, and provides within minutes a complete 3D model that can be downloaded or inspected online. Experienced users can manually select or exclude template structures according to case specific information. LYRA is based on the canonical structure method, that in the last 30 years has been successfully used to generate antibody models of high accuracy, and in our benchmarks this approach proves to achieve similarly good results on TCR modeling, with a benchmarked average RMSD accuracy of 1.29 and 1.48 Å for B- and T-cell receptors, respectively. To the best of our knowledge, LYRA is the first automated server for the prediction of TCR structure.
Recombineering and multiplex automated genome engineering (MAGE) offer the possibility to rapidly modify multiple genomic or plasmid sites at high efficiencies. This enables efficient creation of genetic variants including both single mutants with specifically targeted modifications as well as combinatorial cell libraries. Manual design of oligonucleotides for these approaches can be tedious, time-consuming, and may not be practical for larger projects targeting many genomic sites. At present, the change from a desired phenotype (e.g. altered expression of a specific protein) to a designed MAGE oligo, which confers the corresponding genetic change, is performed manually. To address these challenges, we have developed the MAGE Oligo Design Tool (MODEST). This web-based tool allows designing of MAGE oligos for (i) tuning translation rates by modifying the ribosomal binding site, (ii) generating translational gene knockouts and (iii) introducing other coding or non-coding mutations, including amino acid substitutions, insertions, deletions and point mutations. The tool automatically designs oligos based on desired genotypic or phenotypic changes defined by the user, which can be used for high efficiency recombineering and MAGE. MODEST is available for free and is open to all users at http://modest.biosustain.dtu.dk.
The ability to predict a protein's local structural features from the primary sequence is of paramount importance for unravelling its function if no solved structures of the protein or its homologs are available. Here we present NetSurfP-2.0 (http://services.bioinformatics.dtu.dk/service.php?NetSurfP-2.0), an updated and extended version of the tool that can predict the most important local structural features with unprecedented accuracy and run-time. NetSurfP-2.0 is sequencebased and uses an architecture composed of convolutional and long short-term memory neural networks trained on solved protein structures. Using a single integrated model, NetSurfP-2.0 predicts solvent accessibility, secondary structure, structural disorder, interface residues and backbone dihedral angles for each residue of the input sequences. We assessed the accuracy of NetSurfP-2.0 on several independent validation datasets and found it to consistently produce state-of-the-art predictions for each of its output features. In addition to improved prediction accuracy the processing time has been optimized to allow predicting more than 1,000 proteins in less than 2 hours, and complete proteomes in less than 1 day. INTRODUCTIONThe Anfinsen experiment, showing that the structural characteristics of a protein are encoded in its primary sequence alone, is more than 50 years old (1). As a practical application of it, several methods have been developed over the last decades to predict from sequence only several protein structural features, including solvent accessibility, secondary structure, backbone geometry, disorder, and interface residues (2-7). These tools have tremendously impacted biology and chemistry, and some are among the most cited works in the field. NetSurfP-1.0 (8) is a tool published in 2009 for prediction of solvent accessibility and secondary structure using a feed-forward neural network architecture. Since then, deep learning techniques have affected the application of machine learning in biology expanding the ability of prediction tools to produce more accurate results on complex datasets (9-16).
Identifying the specific human leukocyte antigen (HLA) allele combination of an individual is crucial in organ donation, risk assessment of autoimmune and infectious diseases and cancer immunotherapy. However, due to the high genetic polymorphism in this region, HLA typing requires specialized methods. We investigated the performance of five next-generation sequencing (NGS) based HLA typing tools with a non-restricted license namely HLA*LA, Optitype, HISAT-genotype, Kourami and STC-Seq. This evaluation was done for the five HLA loci, HLA-A, -B, -C, -DRB1 and -DQB1 using whole-exome sequencing (WES) samples from 829 individuals. The robustness of the tools to lower depth of coverage (DOC) was evaluated by subsampling and HLA typing 230 WES samples at DOC ranging from 1X to 100X. The HLA typing accuracy was measured across four typing resolutions. Among these, we present two clinically-relevant typing resolutions (P group and pseudo-sequence), which specifically focus on the peptide binding region. On average, across the five HLA loci examined, HLA*LA was found to have the highest typing accuracy. For the individual loci, HLA-A, -B and -C, Optitype’s typing accuracy was the highest and HLA*LA had the highest typing accuracy for HLA-DRB1 and -DQB1. The tools’ robustness to lower DOC data varied widely and further depended on the specific HLA locus. For all Class I loci, Optitype had a typing accuracy above 95% (according to the modification of the amino acids in the functionally relevant portion of the HLA molecule) at 50X, but increasing the DOC beyond even 100X could still improve the typing accuracy of HISAT-genotype, Kourami, and STC-seq across all five HLA loci as well as HLA*LA’s typing accuracy for HLA-DQB1. HLA typing is also used in studies of ancient DNA (aDNA), which is often based on sequencing data with lower quality and DOC. Interestingly, we found that Optitype’s typing accuracy is not notably impaired by short read length or by DNA damage, which is typical of aDNA, as long as the DOC is sufficiently high.
The human leukocyte antigen (HLA) system is a group of genes coding for proteins that are central to the adaptive immune system and identifying the specific HLA allele combination of a patient is relevant in organ donation, risk assessment of autoimmune and infectious diseases and cancer immunotherapy. However, due to the high genetic polymorphism in this region, HLA typing requires specialized methods. We investigated the performance of five next-generation-sequencing (NGS) based HLA typing tools with a non-restricted license namely HLA*LA, Optitype, HISAT-genotype, Kourami and STC-Seq. This evaluation was done for the five HLA loci, HLA-A, -B, -C, -DRB1 and -DQB1 using whole-exome sequencing (WES) samples from 829 individuals. The robustness of the tools to lower coverage was evaluated by subsampling and HLA typing 230 WES samples at coverages ranging from 1X to 100X. The typing accuracy was measured across four typing resolutions. Among these, we present two clinically-relevant typing resolutions, which specifically focus on the peptide binding region. On average, across the five HLA genes, HLA*LA was found to have the highest typing accuracy. For the individual genes, HLA-A, -B and -C, Optitype's typing accuracy was highest and HLA*LA had the highest typing accuracy for HLA-DRB1 and -DQB1. The tools' robustness to lower coverage data varied widely and further depended on the specific HLA locus. For all class I loci, Optitype had a typing accuracy above 95% (according to the modification of the amino acids in the functionally relevant portion of the protein) at 50X, but increasing the depth of coverage beyond even 100X could still improve the typing accuracy of HISAT-genotype, Kourami, and STC-seq across all five HLA genes as well as HLA*LA's typing accuracy for HLA-DQB1. HLA typing is also used in studies of ancient DNA (aDNA), which often is based on lower quality sequencing data. Interestingly, we found that Optitype's typing accuracy is not notably impaired by short read length or by DNA damage, which is typical of aDNA, as long as the depth of coverage is sufficiently high.
Flow-seq combines flexible genome engineering methods with flow cytometry-based cell sorting and deep DNA sequencing to enable comprehensive interrogation of genotype to phenotype relationships. One application is to study the effect of specific regulatory elements on protein expression. Constructing targeted genomic variation around genomically integrated fluorescent marker genes enables rapid elucidation of the contribution of specific sequence variants to protein expression. Such an approach can be used to characterize the impact of modifications to the Shine-Dalgarno sequence in Escherichia coli.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.