2022
DOI: 10.1101/2022.10.11.511548
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

What is hidden in the darkness? Characterization of AlphaFold structural space

Abstract: The recent public release of the latest version of the AlphaFold database has given us access to over 200 million predicted protein structures. We use a "shape-mer" approach, a structural fragmentation method analogous to sequence k-mers, to describe these structures and look for novelties - both in terms of proteins with rare or novel structural composition and possible functional annotation of under-studied proteins. Data and code will be made available at https://github.com/TurtleTools/afdb-shapemer-darkness Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 24 publications
0
3
0
Order By: Relevance
“…For example, going from 1 to 100 query structures increases run time from 1.49 s to 3.13 s. When searching with multiple structures, most run time is in generating the query structure embeddings. Consequently, the speed benefits of the method arise when searching a structure or structures against the pre-computed embeddings of a huge database such as the AlphaFold database [7, 23]. By artificially copying the SCOPe search set to 10 million or 100 million structures we find that our method takes 2.7 s or 15.7 s to search this set with a single query on CPU.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, going from 1 to 100 query structures increases run time from 1.49 s to 3.13 s. When searching with multiple structures, most run time is in generating the query structure embeddings. Consequently, the speed benefits of the method arise when searching a structure or structures against the pre-computed embeddings of a huge database such as the AlphaFold database [7, 23]. By artificially copying the SCOPe search set to 10 million or 100 million structures we find that our method takes 2.7 s or 15.7 s to search this set with a single query on CPU.…”
Section: Resultsmentioning
confidence: 99%
“…Our method is similar to another approach that embeds protein structure [21], though our embedding is based on coordinates rather than hydrogen bonds. Embedding protein folds has also been done using residue-level features [22, 23], and GNNs acting on protein structure have been used for function prediction [24]. Other studies have used unsupervised contrastive learning on protein structures and show that the representations are useful for downstream prediction tasks including protein structural similarity [25, 26].…”
Section: Introductionmentioning
confidence: 99%
“…Because the models are intentionally topology-agnostic, we were also able to show that LRP can find important atoms from structures that exhibit the Urfold principle of “ architectural similarity despite topological variability ”—specifically, in this work the phosphate binding loops in Rossmann-fold proteins and P-Loop NTPases. In the future, we plan to identify and verify more common fragments and ancestral peptides by aligning and clustering ‘important’ regions from the cross-model fragments identified by LRP, while comparing them to known databases of potentially discontinuous fragment libraries (e.g., from shapemers [27], Fuzzle2.0 [28], ancestral peptides [14], themes [15], or TERMs [29]).…”
Section: Discussionmentioning
confidence: 99%