Representation learning applications in biological sequence analysis

Byun

et al. 2021

Preprint

Analyzing B cell receptor (BCR) repertoires is immensely useful in evaluating one’s immunological status. Conventionally, repertoire analysis methods have focused on comprehensive assessments of clonal compositions, including V(D)J segment usage, nucleotide insertions/deletions, and amino acid distributions. Here, we introduce a novel computational approach that applies deep-learning-based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that this new approach enables us to not only accurately cluster BCR repertoires of coronavirus disease 2019 (COVID-19) patients and healthy subjects but also efficiently track minute changes in immune status over time as patients undergo treatment. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved a mean accuracy rate of over 87% given a repertoire of CDR3 sequences.

Section: Introductionmentioning

confidence: 99%

Computational analysis of B cell receptor repertoires in COVID-19 patients using deep embedded representations of protein sequences

Byun

et al. 2021

Preprint

“…Additionally, BERT, which essentially consists of stacked Transformer encoder layers, shows enhanced performance in down-stream task-specific predictions after pre-training on a massive dataset (Devlin et al , 2019). In the field of bioinformatics, several BERT architectures pre-trained on a massive corpus of protein sequences have been recently proposed, demonstrating their capability to decode the context of biological sequences (Rao et al , 2019; Rives et al , 2021; Elnaggar et al , 2021; Iuchi et al , 2021). In comparison to the protein language models, Ji et al (2021) a pre-trained BERT model, named DNABERT, on a whole human reference genome demonstrated its broad applicability for predicting promoter regions, splicing sites, and transcription factor binding sites upon fine-tuning.…”

Section: Introductionmentioning

confidence: 99%

Prediction of RNA-protein interactions using a nucleotide language model

Yamada

Hamada

2021

Preprint

Self Cite

Motivation:The accumulation of sequencing data has enabled researchers to predict the interactions between RNA sequencesand RNA-binding proteins (RBPs) using novel machine learning techniques. However, existing models are often difficult tointerpret and require additional information to sequences. Bidirectional encoder representations from Transformer (BERT)is a language-based deep learning model that is highly interpretable. Therefore, a model based on BERT architecture canpotentially overcome such limitations.Results:Here, we propose BERT-RBP as a model to predict RNA-RBP interactions by adapting the BERT architecture pre-trained on a human reference genome. Our model outperformed state-of-the-art prediction models using the eCLIP-seq data of154 RBPs. The detailed analysis further revealed that BERT-RBP could recognize both the transcript region type and RNAsecondary structure only from sequential information. Overall, the results provide insights into the fine-tuning mechanism ofBERT in biological contexts and provide evidence of the applicability of the model to other RNA-related problems.Availability:Python source codes are freely available at https://github.com/kkyamada/bert-rbp.

“…Analyzing and deciphering biological sequences plays a critical role in gaining a deeper understanding of biological systems. Recent major advances in artificial intelligence have sparked ample interests in adopting natural language processing (NLP) models to extract hidden insights from biological sequences [1]. By reinterpreting protein sequences as sentences and k-mers in these sequences as words, researchers have succeeded in establishing computational methods to represent the language of life.…”

Section: Introductionmentioning

confidence: 99%

Computational analysis of B cell receptor repertoires in COVID-19 patients using deep embedded representations of protein sequences

Byun

et al. 2021

Preprint

Analyzing B-cell receptor (BCR) repertoires is immensely useful in evaluating one's immunological status. Conventionally,repertoire analysis methods have focused on comprehensive assessment of clonal compositions, including V(D)J segment usage, nucleotide insertion/deletion, and amino acid distribution. Here, we introduce a novel computational approach that applies deep-learning based protein embedding techniques to analyze BCR repertoires. By selecting the most frequently occurring BCR sequences in a given repertoire and computing the sum of the vector representations of these sequences, we represent an entire repertoire as a 100-dimensional vector and eventually as a single data point in vector space. We demonstrate that our new approach enables us to not only accurately cluster repertoires of COVID-19 patients and healthy subjects, but also efficiently track minute changes in immunity conditions as patients undergo a course of treatment over time. Furthermore, using the distributed representations, we successfully trained an XGBoost classification model that achieved over 87% mean accuracy rate given a repertoire of CDR3 sequences.