Latent generative landscapes as maps of functional diversity in protein sequence space

Ziegler, Cheyenne; Martin, Jonathan; Sinner, Claude; Morcos, Faruck

doi:10.1038/s41467-023-37958-z

Cited by 14 publications

(16 citation statements)

References 95 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In total, 21 interpretable features were used in the models, including properties derived from protein sequences, structures, networks, and gene mutational constraints (Badonyi and Marsh 2023). Additionally, 20 language model-based embeddings were also included, which are thought to represent protein function in their latent space [41]. As a measure of feature importance, we calculated the loss in AUROC relative to the full model ( Figure 3 ).…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Proteome-scale prediction of molecular mechanisms underlying dominant genetic diseases

Badonyi,

Marsh

2023

Preprint

View full text Add to dashboard Cite

Many dominant genetic disorders result from protein-altering mutations, acting primarily through dominant-negative (DN), gain-of-function (GOF), and loss-of-function (LOF) mechanisms. Deciphering the mechanisms by which dominant diseases exert their effects is often experimentally challenging and resource intensive, but is essential for developing appropriate therapeutic approaches. Diseases that arise via a LOF mechanism are more amenable to be treated by conventional gene therapy, whereas DN and GOF mechanisms may require gene editing or targeting by small molecules. Moreover, pathogenic missense mutations that act via DN and GOF mechanisms are more difficult to identify than those that act via LOF using nearly all currently available variant effect predictors. Here, we introduce a tripartite statistical model made up of support vector machine binary classifiers trained to predict whether human protein-coding genes are likely to be associated with DN, GOF, or LOF molecular disease mechanisms. We test the utility of the predictions by examining biologically and clinically meaningful properties known to be associated with the mechanisms. Our results strongly support that the models are able to generalise on unseen data and offer insight into the functional attributes of proteins associated with different mechanisms. We hope that our predictions will serve as a springboard for researchers studying novel variants and those of uncertain clinical significance, guiding variant interpretation strategies and experimental characterisation. Predictions for the 2023_02 UniProt reference proteome are available athttps://osf.io/z4dcp/.

show abstract

Section: Resultsmentioning

confidence: 99%

“…Additionally, 20 language model-based embeddings were also included, which are thought to represent protein function in their latent space [41]. As a measure of feature importance, we calculated the loss in AUROC relative to the full model (Figure 3).…”

Section: Global and Local Feature Importance Evaluationmentioning

confidence: 99%

Proteome-scale prediction of molecular mechanisms underlying dominant genetic diseases

Badonyi,

Marsh

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Formally, H can be used to compute the probability of finding a particular sequence in the input protein family. H serves as a proxy for protein fitness − and has been described as a measure of the “typicality” of a protein sequence within its family, with more negative values signifying more family like or typical sequences . Coevolutionary information-based scoring has also been shown to predict specificity between histidine kinases and response regulators, compatibility between DNA recognition and allosteric response modules in LacI-type transcription inhibitors, folding kinetics, and mutational phenotypes in protein–RNA complexes …”

Section: Resultsmentioning

confidence: 99%

Coevolutionary Information Captures Catalytic Functions and Reveals Divergent Roles of Terpene Synthase Interdomain Connections

Nartey,

Koo,

Laurendon

et al. 2024

Biochemistry

Self Cite

View full text Add to dashboard Cite

Inferring the historical and biophysical causes of diversity within protein families is a complex puzzle. A key to unraveling this problem is characterizing the rugged topography of sequence−function adaptive landscapes. Using biochemical data from a 2 9 = 512 combinatorial library of tobacco 5-epiaristolochene synthase (TEAS) mutants engineered to make the native major product of Egyptian henbane premnaspirodiene synthase (HPS) and a complementary 512 mutant HPS library, we address the question of how product specificity is controlled. These data sets reveal that HPS is far more robust and resistant to mutations than TEAS, where most mutants are promiscuous. We also combine experimental data with a sequence Potts Hamiltonian model and direct coupling analysis to quantify mutant fitness. Our results demonstrate that the Hamiltonian captures variation in product outputs across both libraries, clusters native family members based on their substrate specificities, and exposes the divergent catalytic roles of couplings between the catalytic and noncatalytic domains of TEAS versus HPS. Specifically, we found that the role of the interdomain connectivities in specifying product output is more important in TEAS than connectivities within the catalytic domain. Despite being 75% identical, this property is not shared by HPS, where connectivities within the catalytic domain are more important for specificity. By solving the X-ray crystal structure of HPS, we assessed structural bases for their interdomain network differences. Last, we calculate the product profile Shannon entropies of the two libraries, which showcases that site−site connectivities also play divergent roles in catalytic accuracy.

show abstract

“…219 In addition, by exploring the latent manifold underlying the sequence information, we can uncover dependencies that may not be readily apparent in the raw latent space embeddings. 220 Despite its advantages, MSAs also have some drawbacks. First, it can be difficult to create an MSA that contains enough evolutionarily relevant sequences to establish strong patterns at key amino acid positions.…”

Section: Supervised Learning To Predict the Effects Of Mutationsmentioning

confidence: 99%

“…Building on these findings, the geometric structure of a latent space was recently used to guide the design of a haloalkane dehalogenase . In addition, by exploring the latent manifold underlying the sequence information, we can uncover dependencies that may not be readily apparent in the raw latent space embeddings …”

Section: Protein Engineering Tasks Solved By Machine Learningmentioning

confidence: 99%

Machine Learning-Guided Protein Engineering

Kouba,

Kohout,

Haddadi

et al. 2023

ACS Catal.

View full text Add to dashboard Cite

Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.

show abstract

Latent generative landscapes as maps of functional diversity in protein sequence space

Cited by 14 publications

References 95 publications

Proteome-scale prediction of molecular mechanisms underlying dominant genetic diseases

Proteome-scale prediction of molecular mechanisms underlying dominant genetic diseases

Coevolutionary Information Captures Catalytic Functions and Reveals Divergent Roles of Terpene Synthase Interdomain Connections

Machine Learning-Guided Protein Engineering

Contact Info

Product

Resources

About