EVEREST: automatic identification and classification of protein domains in all protein sequences

Portugaly, Elon; Harel, Amir; Linial, Michal; Linial, Michal

doi:10.1186/1471-2105-7-277

Cited by 32 publications

(28 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…InterProScan). Domain and family-based resources provide an excellent coverage of the 'known space' using HMMs (12,000 in Pfam [24], 37,000 in EVEREST [25]). Iterative search using PSSM and HMM Profiles are often used for a comprehensive functional inference.…”

Section: Discussionmentioning

confidence: 99%

Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex

Rappoport

Linial

2013

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

BackgroundDaphnia pulex (Water flea) is the first fully sequenced crustacean genome. The crustaceans and insects have diverged from a common ancestor. It is a model organism for studying the molecular makeup for coping with the environmental challenges. In the complete proteome, there are 30,550 putative proteins. However, about 10,000 of them have no known homologues. Currently, the UniProtoKB reports on 95% of the Daphnia's proteins as putative and uncharacterized proteins.ResultsWe have applied ProtoNet, an unsupervised hierarchical protein clustering method that covers about 10 million sequences, for automatic annotation of the Daphnia's proteome. 98.7% (26,625) of the Daphnia full-length proteins were successfully mapped to 13,880 ProtoNet stable clusters, and only 1.3% remained unmapped. We compared the properties of the Daphnia's protein families with those of the mouse and the fruitfly proteomes. Functional annotations were successfully assigned for 86% of the proteins. Most proteins (61%) were mapped to only 2953 clusters that contain Daphnia's duplicated genes. We focused on the functionality of maximally amplified paralogs. Cuticle structure components and a variety of ion channels protein families were associated with a maximal level of gene amplification. We focused on gene amplification as a leading strategy of the Daphnia in coping with environmental toxicity.ConclusionsAutomatic inference is achieved through mapping of sequences to the protein family tree of ProtoNet 6.0. Applying a careful inference protocol resulted in functional assignments for over 86% of the complete proteome. We conclude that the scaffold of ProtoNet can be used as an alignment-free protocol for large-scale annotation task of uncharacterized proteomes.

show abstract

Section: Discussionmentioning

confidence: 99%

Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex

Rappoport

Linial

2013

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…The correct way to solve this problem is to parse all of the sequences and split sequence A into two chains A1 and A2. Such parsing is not trivial (20)(21)(22), and here we deal with the problem in a different way.…”

Section: Discussionmentioning

confidence: 99%

“…Preliminary tests of automatic splitting with a 40% sequence identity threshold give a total of 51,765 chains versus the original number of 44,220 chains and has a reassuringly small effect on the results presented here. Correct parsing of chains into domains is difficult (20)(21)(22).…”

Section: Discussionmentioning

confidence: 99%

Growth of novel protein structural data

Levitt

2007

Proc. Natl. Acad. Sci. U.S.A.

145

View full text Add to dashboard Cite

Contrary to popular assumption, the rate of growth of structural data has slowed, and the Protein Data Bank (PDB) has not been growing exponentially since 1995. Reaching such a dramatic conclusion requires careful measurement of growth of novel structures, which can be achieved by clustering entry sequences, or by using a novel index to down-weight entries with a higher number of sequence neighbors. These measures agree, and growth rates are very similar for entire PDB files, clusters, and weighted chains.

show abstract

“…To this end we have developed a scoring scheme that enables scoring an evaluated domain family with respect to a reference domain family in the context of a reference system of domain families. A detailed description of the scoring scheme and the results of applying it to EVEREST is given in (10). Briefly, for an evaluated family e , let π( e ) be a collection of reference domains given by allowing each domain in the evaluated family to collect those reference domains that significantly intersect with it.…”

Section: Technical Detailsmentioning

confidence: 99%

EVEREST: a collection of evolutionary conserved protein domains

Portugaly

Linial²,

Linial

2007

Nucleic Acids Research

View full text Add to dashboard Cite

Protein domains are subunits of proteins that recur throughout the protein world. There are many definitions attempting to capture the essence of a protein domain, and several systems that identify protein domains and classify them into families. EVEREST, recently described in Portugaly et al. (2006) BMC Bioinformatics, 7, 277, is one such system that performs the task automatically, using protein sequence alone. Herein we describe EVEREST release 2.0, consisting of 20 029 families, each defined by one or more HMMs. The current EVEREST database was constructed by scanning UniProt 8.1 and all PDB sequences (total over 3 000 000 sequences) with each of the EVEREST families. EVEREST annotates 64% of all sequences, and covers 59% of all residues. EVEREST is available at . The website provides annotations given by SCOP, CATH, Pfam A and EVEREST. It allows for browsing through the families of each of those sources, graphically visualizing the domain organization of the proteins in the family. The website also provides access to analyzes of relationships between domain families, within and across domain definition systems. Users can upload sequences for analysis by the set of EVEREST families. Finally an advanced search form allows querying for families matching criteria regarding novelty, phylogenetic composition and more.

show abstract

EVEREST: automatic identification and classification of protein domains in all protein sequences

Cited by 32 publications

References 30 publications

Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex

Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex

Growth of novel protein structural data

EVEREST: a collection of evolutionary conserved protein domains

Contact Info

Product

Resources

About