The database of known protein three-dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences, currently at more than 12,000 proteins, is two orders of magnitude larger than the database of known structures. (2) The currently most powerful method of predicting protein structures is model building by homology. (3) Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity, and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability, and sequence profile. Tertiary -structures of the aligned sequences are implied, but not modeled explicitly. The database effectively increases the number of known protein structures by a factor of five to more than 1800. The results may be useful in assessing the structural significance of matches in sequence database searches, in deriving preferences and patterns for structure prediction, in elucidating the structural role of conserved residues, and in modeling three-dimensional detail by homology.
Despite our rapidly growing knowledge about the human genome, we do not know all of the genes required for some of the most basic functions of life. To start to fill this gap we developed a high-throughput phenotypic screening platform combining potent gene silencing by RNA interference, time-lapse microscopy and computational image processing. We carried out a genome-wide phenotypic profiling of each of the ,21,000 human protein-coding genes by two-day live imaging of fluorescently labelled chromosomes. Phenotypes were scored quantitatively by computational image processing, which allowed us to identify hundreds of human genes involved in diverse biological functions including cell division, migration and survival. As part of the Mitocheck consortium, this study provides an in-depth analysis of cell division phenotypes and makes the entire high-content data set available as a resource to the community.To target the ,21,000 protein-coding genes in the human genome, we used a chemically synthesized short interfering RNA (siRNA) library designed to uniquely target each gene with 2-3 independent sequences (Supplementary Methods). The siRNAs in this library were tested individually and reduced the messenger RNAs of targeted genes to below 30% of original levels (to an average of 13%) for 97% of more than 1,000 genes tested (Supplementary Table 1). To allow high-throughput phenotyping of each individual siRNA in triplicates by live-cell imaging, we used a previously established workflow for solid-phase transfection using siRNA microarrays coupled to automatic time-lapse microscopy 1 . As a high-content phenotypic assay we chose to monitor fluorescent chromosomes in a human cell line stably expressing core histone 2B tagged with green fluorescent protein (GFP) 1 . After seeding on the siRNA microarrays, on average 67 (630) cells for each siRNA of the library were imaged in triplicates for 2 days, thus documenting many of their basic functions such as cell division, proliferation, survival and migration. Image processing reveals mitotic hitsThis resulted in a large data set of ,190,000 time-lapse movies providing time-resolved records of over 19 million cell divisions. To automatically score and annotate phenotypes in this large data set, we developed a computational pipeline 2 ( Fig. 1) extending previously established methods of morphology recognition by supervised machine learning [3][4][5][6] . In brief, after segmentation, about 200 quantitative features were extracted from each nucleus and used for classification into one of 16 morphological classes ( Fig. 1 and Supplementary Movies 1-30) by a support vector machine classifier previously trained on a set of ,3,000 manually annotated nuclei (Supplementary Methods). This classifier automatically recognizes changes in nuclear morphology due to the cell cycle, cell death or other phenotypic changes with an overall accuracy of 87% (Supplementary Fig. 1) and allows us to convert each time-lapse movie into a phenotypic profile that quantifies the response to each siRNA ...
The maintenance of protein function and structure constrains the evolution of amino acid sequences. This fact can be exploited to interpret correlated mutations observed in a sequence family as an indication of probable physical contact in three dimensions. Here we present a simple and general method to analyze correlations in mutational behavior between different positions in a multiple sequence alignment. We then use these correlations to predict contact maps for each of 11 protein families and compare the result with the contacts determined by crystallography. For the most strongly correlated residue pairs predicted to be in contact, the prediction accuracy ranges from 37 to 68% and the improvement ratio relative to a random prediction from 1.4 to 5.1. Predicted contact maps can be used as input for the calculation of protein tertiary structure, either from sequence information alone or in combination with experimental information.
The Protein Data Bank (PDB) is the world-wide repository of macromolecular structure information. We present a series of databases that run parallel to the PDB. Each database holds one entry, if possible, for each PDB entry. DSSP holds the secondary structure of the proteins. PDBREPORT holds reports on the structure quality and lists errors. HSSP holds a multiple sequence alignment for all proteins. The PDBFINDER holds easy to parse summaries of the PDB file content, augmented with essentials from the other systems. PDB_REDO holds re-refined, and often improved, copies of all structures solved by X-ray. WHY_NOT summarizes why certain files could not be produced. All these systems are updated weekly. The data sets can be used for the analysis of properties of protein structures in areas ranging from structural genomics, to cancer biology and protein design.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.