Critiquing Protein Family Classification Models Using Sufficient Input Subsets

Carter, Brandon; Bileschi, Maxwell L; Smith, Jamie; Sanderson, Theo; Bryant, Drew H.; Belanger, David; Colwell, Lucy

doi:10.1101/674119

Cited by 8 publications

(8 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This differs substantially from approaches such as BLASTp, phmmer and HMMER that perform annotation using explicit alignment. We note that simpler models provide useful attribution of model decision making, and we anticipate that similar insights will emerge from work that improves the interpretation and understanding of deep learning models [41][42][43].…”

Section: Discussionmentioning

confidence: 88%

Using deep learning to annotate the protein universe

et al. 2022

View full text Add to dashboard Cite

Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. In this paper, we explore an alternative methodology based on deep learning that learns the relationship between unaligned amino acid sequences and their functional annotations across all 17929 families of the Pfam database. Using the Pfam seed sequences we establish rigorous benchmark assessments that use both random and clustered data splits to control for potentially confounding sequence similarities between train and test sequences. Using Pfam full, we report convolutional networks that are significantly more accurate and computationally efficient than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space, allowing sequences from novel families to be accurately annotated. These results suggest deep learning models will be a core component of future protein function prediction tools.Predicting the function of a protein from its raw amino acid sequence is a critical step for understanding the relationship between genotype and phenotype. As the cost of DNA sequencing drops and metagenomic sequencing projects flourish, fast and efficient tools that annotate open reading frames with function will play a central role in exploiting this data [1,2]. Doing so will help identify proteins that catalyze novel reactions, design new proteins that bind specific microbial targets, or build molecules that accelerate advances in biotechnology. Current practice for functional prediction of a novel protein sequence involves alignment across a large database of annotated sequences using algorithms such as 1

show abstract

Section: Discussionmentioning

confidence: 88%

Using deep learning to annotate the protein universe

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Neural models are fast to evaluate with a single forward pass. However, they can exhibit pathological behavior when used as optimization objectives, giving high scores to unrealistic sequence [49, 50] or giving outsize influence to irrelevant parts of the sequence [51]. While trained neural models can exhibit high levels of ruggedness [52], it is not straightforward to tune the optimization difficulty of a neural landscape.…”

Section: Background and Related Workmentioning

confidence: 99%

Tuned Fitness Landscapes for Benchmarking Model-Guided Protein Design

Thomas

Agarwala

Belanger

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Advancements in DNA synthesis and sequencing technologies have enabled a novel paradigm of protein design where machine learning (ML) models trained on experimental data are used to guide exploration of a protein fitness landscape. ML-guided directed evolution (MLDE) builds on the success of traditional directed evolution and unlocks strategies which make more efficient use of experimental data. Building an MLDE pipeline involves many design choices across the design-build-test-learn loop ranging from data collection strategies to modeling, each of which has a large impact on the success of designed sequences. The cost of collecting experimental data makes benchmarking every component of these pipelines on real data prohibitively difficult, necessitating the development of synthetic landscapes where MLDE strategies can be tested. In this work, we develop a framework called SLIP (Synthetic Landscape Inference for Proteins) for constructing biologically-motivated synthetic landscapes with tunable difficulty based on Potts models. This framework can be extended to any protein family for which there is a sequence alignment. We show that without tuning, Potts models are easy to optimize. In contrast, our tuning framework provides landscapes sufficiently challenging to benchmark MLDE pipelines. SLIP is open-source and is available at https://github.com/google-research/slip.

show abstract

“…We applied Sufficient Input Subset (SIS) analysis (Carter et al , 2018) to interpret the sequence features the Embedding-Only model has learned to identify MHC ligands. On 10 000 random samples of all MHC ligands of 9 amino acids in our dataset, we performed SIS to locate the minimal subset of residuals for the Embedding-Only model to predict a peptide as MHC ligand with a probability >95% (Section 2).…”

Section: Resultsmentioning

confidence: 99%

DeepLigand: accurate prediction of MHC class I ligands using peptide embedding

Zeng

Gifford

2019

Bioinformatics

View full text Add to dashboard Cite

Motivation The computational modeling of peptide display by class I major histocompatibility complexes (MHCs) is essential for peptide-based therapeutics design. Existing computational methods for peptide-display focus on modeling the peptide-MHC-binding affinity. However, such models are not able to characterize the sequence features for the other cellular processes in the peptide display pathway that determines MHC ligand selection. Results We introduce a semi-supervised model, DeepLigand that outperforms the state-of-the-art models in MHC Class I ligand prediction. DeepLigand combines a peptide language model and peptide binding affinity prediction to score MHC class I peptide presentation. The peptide language model characterizes sequence features that correspond to secondary factors in MHC ligand selection other than binding affinity. The peptide embedding is learned by pre-training on natural ligands, and can discriminate between ligands and non-ligands in the absence of binding affinity prediction. Although conventional affinity-based models fail to classify peptides with moderate affinities, DeepLigand discriminates ligands from non-ligands with consistently high accuracy. Availability and implementation We make DeepLigand available at https://github.com/gifford-lab/DeepLigand. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Critiquing Protein Family Classification Models Using Sufficient Input Subsets

Cited by 8 publications

References 44 publications

Using deep learning to annotate the protein universe

Using deep learning to annotate the protein universe

Tuned Fitness Landscapes for Benchmarking Model-Guided Protein Design

DeepLigand: accurate prediction of MHC class I ligands using peptide embedding

Contact Info

Product

Resources

About