Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.
A large-scale effort to measure, detect and analyse protein-protein interactions using experimental methods is under way. These include biochemistry such as co-immunoprecipitation or crosslinking, molecular biology such as the two-hybrid system or phage display, and genetics such as unlinked noncomplementing mutant detection. Using the two-hybrid system, an international effort to analyse the complete yeast genome is in progress. Evidently, all these approaches are tedious, labour intensive and inaccurate. From a computational perspective, the question is how can we predict that two proteins interact from structure or sequence alone. Here we present a method that identifies gene-fusion events in complete genomes, solely based on sequence comparison. Because there must be selective pressure for certain genes to be fused over the course of evolution, we are able to predict functional associations of proteins. We show that 215 genes or proteins in the complete genomes of Escherichia coli, Haemophilus influenzae and Methanococcus jannaschii are involved in 64 unique fusion events. The approach is general, and can be applied even to genes of unknown function.
Somatic hypermutation (SHM) features in a series of 1967 immunoglobulin heavy chain gene (IGH) rearrangements obtained from patients with chronic lymphocytic leukemia (CLL) were examined and compared with IGH sequences from non-CLL B cells available in public databases. SHM analysis was performed for all 1290 CLL sequences in this cohort with less than 100% identity to germ line. At the cohort level, SHM patterns were typical of a canonical SHM process. However, important differences emerged from the analysis of certain subgroups of CLL sequences defined by: (1) IGHV gene usage, (2) presence of stereotyped heavy chain complementarity-determining region 3 (HCDR3) sequences, and (3) mutational load. Recurrent, "stereotyped" amino acid changes occurred across the entire IGHV region in CLL subsets carrying stereotyped HCDR3 sequences, especially those expressing the IGHV3-21 and IGHV4-34 genes. These mutations are underrepresented among non-CLL sequences and thus can be considered as CLL-biased. Furthermore, it was shown that even a low level of mutations may be functionally relevant, given that stereotyped amino acid changes can be found in subsets of minimally mutated cases. IntroductionDeveloping B cells generate a vast repertoire of antibody specificities through somatic recombination of distinct variable (V), diversity (D) (heavy chain only), and joining (J) genes to form the variable domain exons of immunoglobulins (IG). 1 Unlike heavy chain complementarity determining regions (HCDR) 1 and 2, which are entirely encoded by the IGHV gene, HCDR3 is created de novo by the VDJ recombination process. 1 The skewing of diversity to the HCDR3 implies that HCDR3 sequences are the principal determinants of specificity, at least in the primary repertoire. 2,3 However, HCDR3 diversity is not enough to realize the full potential of antibody diversity. 4 Furthermore, unconventional antigens, such as B-cell superantigens, may be recognized not via the CDRs but rather via the framework regions (FRs). 5 Somatic hypermutation (SHM) of IG variable genes forms a second round of diversification after somatic recombination which increases antibody diversity. 6 SHM has long been thought to occur mainly in the germinal centers (GCs) after antigen stimulation and in a manner dependent on T-cell help. 7 Recent reports, however, suggest that SHM can be T-cell independent and may also occur outside classic GCs. [8][9][10][11][12][13] In recent years, the mutational status of IGHV genes has been established as one of the most important molecular genetic markers in defining prognostic subgroups of chronic lymphocytic leukemia (CLL). CLL patients who carry IGHV genes with 98% identity or more to the closest germ line gene ("unmutated") follow a more aggressive clinical course and have strikingly shorter survival than patients carrying IGHV genes with less than 98% identity to germ line ("mutated"). 14,15 The 98% cutoff was chosen as a shortcut to exclude potential polymorphic variants [16][17][18][19] and has been used by the majority of st...
In the context of future scenarios of progressive accumulation of anthropogenic CO 2 in marine surface waters, the present study addresses the effects of long-term hypercapnia on a Mediterranean bivalve, Mytilus galloprovincialis. Sea-water pH was lowered to a value of 7.3 by equilibration with elevated CO 2 levels. This is close to the maximum pH drop expected in marine surface waters during atmospheric CO 2 accumulation. Intra-and extracellular acid -base parameters as well as changes in metabolic rate and growth were studied under both normocapnia and hypercapnia. Long-term hypercapnia caused a permanent reduction in haemolymph pH. To limit the degree of acidosis, mussels increased haemolymph bicarbonate levels, which are derived mainly from the dissolution of shell CaCO 3 . Intracellular pH in various tissues was at least partly compensated; no deviation from control values occurred during long-term measurements in whole soft-body tissues. The rate of oxygen consumption fell significantly, indicating a lower metabolic rate. In line with previous reports, a close correlation became evident between the reduction in extracellular pH and the reduction in metabolic rate of mussels during hypercapnia. Analysis of frequency histograms of growth rate revealed that hypercapnia caused a slowing of growth, possibly related to the reduction in metabolic rate and the dissolution of shell CaCO 3 as a result of extracellular acidosis. In addition, increased nitrogen excretion by hypercapnic mussels indicates the net degradation of protein, thereby contributing to growth reduction. The results obtained in the present study strongly indicate that a reduction in sea-water pH to 7.3 may be fatal for the mussels. They also confirm previous observations that a reduction in sea-water pH below 7.5 is harmful for shelled molluscs.
The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB in the BioCyc collection describes the genome and predicted metabolic network of a single organism, inferred from the MetaCyc database, which is a reference source on metabolic pathways from multiple organisms. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms. This paper discusses the computational methodology by which the BioCyc collection has been expanded, and presents an aggregate analysis of the collection that includes the range of number of pathways present in these organisms, and the most frequently observed pathways. We seek scientists to adopt and curate individual PGDBs within the BioCyc collection. Only by harnessing the expertise of many scientists we can hope to produce biological databases, which accurately reflect the depth and breadth of knowledge that the biomedical research community is producing.
Complex cellular processes are modular and are accomplished by the concerted action of functional modules (Ravasz et al., Science 2002;297:1551-1555; Hartwell et al., Nature 1999;402:C47-52). These modules encompass groups of genes or proteins involved in common elementary biological functions. One important and largely unsolved goal of functional genomics is the identification of functional modules from genomewide information, such as transcription profiles or protein interactions. To cope with the ever-increasing volume and complexity of protein interaction data (Bader et al., Nucleic Acids Res 2001;29:242-245; Xenarios et al., Nucleic Acids Res 2002;30:303-305), new automated approaches for pattern discovery in these densely connected interaction networks are required (Ravasz et al., Science 2002;297:1551-1555; Bader and Hogue, Nat Biotechnol 2002;20:991-997; Snel et al., Proc Natl Acad Sci USA 2002;99:5890-5895). In this study, we successfully isolate 1046 functional modules from the known protein interaction network of Saccharomyces cerevisiae involving 8046 individual pair-wise interactions by using an entirely automated and unsupervised graph clustering algorithm. This systems biology approach is able to detect many well-known protein complexes or biological processes, without reference to any additional information. We use an extensive statistical validation procedure to establish the biological significance of the detected modules and explore this complex, hierarchical network of modular interactions from which pathways can be inferred.
Disease-causing point mutations are assumed to act predominantly through subsequent individual changes in the amino acid sequence that impair the normal function of proteins. However, point mutations can have a more dramatic effect by altering the splicing pattern of the gene. Here, we describe an approach to estimate the overall importance of splicing mutations. This approach takes into account the complete set of genes known to be involved in disease and suggest that, contrary to current assumptions, many mutations causing disease may actually be affecting the splicing pattern of the genes.
The life cycle of the parasite Plasmodium falciparum, responsible for the most deadly form of human malaria, requires specialized protein expression for survival in the mammalian host and insect vector. To identify components of processes controlling gene expression during its life cycle, the malarial genome-along with seven crown eukaryote group genomes-was queried with a reference set of transcription-associated proteins (TAPs). Following clustering on the basis of sequence similarity of the TAPs with their homologs, and together with hidden Markov model profile searches, 156 P. falciparum TAPs were identified. This represents about a third of the number of TAPs usually found in the genome of a free-living eukaryote. Furthermore, the P. falciparum genome appears to contain a low number of sequences, which are highly conserved and abundant within the kingdoms of free-living eukaryotes, that contribute to gene-specific transcriptional regulation. However, in comparison with these other eukaryotic genomes, the CCCH-type zinc finger (common in proteins modulating mRNA decay and translation rates) was found to be the most abundant in the P. falciparum genome. This observation, together with the paucity of malarial transcriptional regulators identified, suggests Plasmodium protein levels are primarily determined by posttranscriptional mechanisms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.