Clara Fannjiang scite author profile

Motivation: Inferring properties of biological sequences--such as determining the species-of-origin of a DNA sequence or the function of an amino-acid sequence--is a core task in many bioinformatics applications. These tasks are often solved using string-matching to map query sequences to labeled database sequences or via Hidden Markov Model-like pattern matching. In the current work we describe and assess an deep learning approach which trains a deep neural network (DNN) to predict database-derived labels directly from query sequences. Results: We demonstrate this DNN performs at state-of-the-art or above levels on a difficult, practically important problem: predicting species-of-origin from short reads of 16S ribosomal DNA. When trained on 16S sequences of over 13,000 distinct species, our DNN achieves read-level species classification accuracy within 2.0% of perfect memorization of training data, and produces more accurate genus-level assignments for reads from held-out species than k -mer, alignment, and taxonomic binning baselines. Moreover, our models exhibit greater robustness than these existing approaches to increasing noise in the query sequences. Finally, we show that these DNNs perform well on experimental 16S mock community dataset. Overall, our results constitute a first step towards our long-term goal of developing a general-purpose deep learning approach to predicting meaningful labels from short biological sequences. Availability: TensorFlow training code is available through GitHub ( https://github.com/tensorflow/models/tree/master/research ). Data in TensorFlow TFRecord format is available on Google Cloud Storage (gs://brain-genomics-public/research/seq2species/).

show abstract

Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy

Zhu

Brookes

Busia

et al. 2021

Preprint

View full text Add to dashboard Cite

AAVs hold tremendous promise as delivery vectors for clinical gene therapy. Yet the ability to design libraries comprising novel and diverse AAV capsids, while retaining the ability of the library to package DNA payloads, has remained challenging. Deep sequencing technologies allow millions of sequences to be assayed in parallel, enabling large-scale probing of fitness landscapes. Such data can be used to train supervised machine learning (ML) models that predict viral properties from sequence, without mechanistic knowledge. Herein, we leverage such models to rationally trade-off library diversity with packaging capability. In particular, we show a proof-of-principle application of a general approach for ML-guided library design that allows the experimenter to rationally navigate the trade-off between sequence diversity and fitness of the library. Consequently, this approach, instantiated with an AAV capsid library designed for packaging, enables the selection of starting libraries that are more likely to yield success in downstream selections for therapeutics and beyond. We demonstrated this increased success by showing that the designed libraries are able to more easily infect primary human brain tissue. We expect that such ML-guided design of AAV libraries will have broad utility for the development of novel variants for therapeutic applications in the near future.One Sentence SummaryComputational, data-driven re-design of a state-of-the-art therapeutically relevant AAV initial library improves downstream selection for therapeutic uses.

show abstract

Combining evolutionary and assay-labelled data for protein fitness prediction

Hsu

Nisonoff

Fannjiang

2021

Preprint

View full text Add to dashboard Cite

Predictive modelling of protein properties has become increasingly important to the field of machine-learning guided protein engineering. In one of the two existing approaches, evolutionarily-related sequences to a query protein drive the modelling process, without any property measurements from the laboratory. In the other, a set of protein variants of interest are assayed, and then a supervised regression model is estimated with the assay-labelled data. Although a handful of recent methods have shown promise in combining the evolutionary and supervised approaches, this hybrid problem has not been examined in depth, leaving it unclear how practitioners should proceed, and how method developers should build on existing work. Herein, we present a systematic assessment of methods for protein fitness prediction when evolutionary and assay-labelled data are available. We find that a simple baseline approach we introduce is competitive with and often outperforms more sophisticated methods. Moreover, our simple baseline is plug-and-play with a wide variety of established methods, and does not add any substantial computational burden. Our analysis highlights the importance of systematic evaluations and sufficient baselines.

show abstract

Mobile robotic platforms for the acoustic tracking of deep-sea demersal fishery resources

et al. 2020

View full text Add to dashboard Cite

Knowing the displacement capacity and mobility patterns of industrially exploited (i.e., fished) marine resources is key to establishing effective conservation management strategies in human-impacted marine ecosystems. Acquiring accurate behavioral information of deep-sea fished ecosystems is necessary to establish the sizes of marine protected areas within the framework of large international societal programs (e.g., European Community H2020, as part of the Blue Growth economic strategy). However, such information is currently scarce, and high-frequency and prolonged data collection is rarely available. Here, we report the implementation of autonomous underwater vehicles and remotely operated vehicles as an aid for acoustic long-baseline localization systems for autonomous tracking of Norway lobster (Nephrops norvegicus), one of the key living resources exploited in European waters. In combination with seafloor moored acoustic receivers, we detected and tracked the movements of 33 tagged lobsters at 400-m depth for more than 3 months. We also identified the best procedures to localize both the acoustic receivers and the tagged lobsters, based on algorithms designed for off-the-shelf acoustic tags identification. Autonomous mobile platforms that deliver data on animal behavior beyond traditional fixed platform capabilities represent an advance for prolonged, in situ monitoring of deep-sea benthic animal behavior at meter spatial scales.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Clara Fannjiang

Learning protein fitness models from evolutionary and assay-labeled data

A deep learning approach to pattern recognition for short DNA sequences

Optimal trade-off control in machine learning-based library design, with application to adeno-associated virus (AAV) for gene therapy

Combining evolutionary and assay-labelled data for protein fitness prediction

Mobile robotic platforms for the acoustic tracking of deep-sea demersal fishery resources

Contact Info

Product

Resources

About