Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions

Yi, Hai-Cheng; You, Zhu-Hong; Cheng, Li; Zhou, Xi; Jiang, Tongwen; Xiao, Li; Wang, Yanbin

doi:10.1016/j.csbj.2019.11.004

Cited by 40 publications

(16 citation statements)

References 35 publications

(41 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deep learning shows excellent ability with large-scale data support in many fields, however, ncRPIs data sets generally don't have large scales, thus it's not very suitable or urgent need for deep learning methods. Previous research confirmed that in ncRPIs prediction task, tree-based model and SVM model can work well, and sequences contain enough information for predicting ncRPIs [25,26]. Traditional machine learning techniques have the potential to be explored for accuracy and interpretability in small sample learning tasks, especially ncRNA-protein interactions prediction task.…”

Section: Introductionmentioning

confidence: 87%

RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information

You

Wang

et al. 2020

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

Background: The interactions between non-coding RNAs (ncRNA) and proteins play an essential role in many biological processes. Several high-throughput experimental methods have been applied to detect ncRNA-protein interactions. However, these methods are time-consuming and expensive. Accurate and efficient computational methods can assist and accelerate the study of ncRNA-protein interactions.Results: In this work, we develop a stacking ensemble computational framework, RPI-SE, for effectively predicting ncRNA-protein interactions. More specifically, to fully exploit protein and RNA sequence feature, Position Weight Matrix combined with Legendre Moments is applied to obtain protein evolutionary information. Meanwhile, k-mer sparse matrix is employed to extract efficient feature of ncRNA sequences. Finally, an ensemble learning framework integrated different types of base classifier is developed to predict ncRNA-protein interactions using these discriminative features. The accuracy and robustness of RPI-SE was evaluated on three benchmark data sets under five-fold cross-validation and compared with other state-of-the-art methods. Conclusions:The results demonstrate that RPI-SE is competent for ncRNA-protein interactions prediction task with high accuracy and robustness. It's anticipated that this work can provide a computational prediction tool to advance ncRNA-protein interactions related biomedical research.

show abstract

Section: Introductionmentioning

confidence: 87%

RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information

You

Wang

et al. 2020

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…This kind of model is a shallow two-layer neural network. In recent bioinformatics studies, some methods [22,23] have been used to train word embedding models for DNA, proteins, and lncRNAs, and it has been proved that this method is superior to the traditional processing sequence embedding methods such as one-hot and Kmers.…”

Section: Distribution Representation Of Mirna and Mrna Sequencesmentioning

confidence: 99%

A MiRNA Target Prediction Model based on Distributed Representation Learning and Deep Learning

Sun¹,

Xiong²,

Sun³

et al. 2021

Preprint

View full text Add to dashboard Cite

Background: MicroRNAs (miRNAs) are a kind of non-coding RNA, which plays an essential role in gene regulation by binding to messenger RNAs(mRNAs). Accurate and rapid identification of miRNA target genes is helpful to reveal the mechanism of transcriptome regulation, which is of great significance for the study of cancer and other diseases. Many bioinformatics methods have been proposed to solve this problem, but the previous research did not further study the encoding of the base sequence. Results: In this study, we developed a novel method combining word embedding and deep learning for human miRNA targets at the site level prediction, which is inspired by the similarity between natural language and biological sequences. First, the wor2vec model was used to mine the distribution representation of miRNAs and mRNAs. Then, the data features are fully extracted automatically from temporal and spatial via the stacked Bidirectional Long short-term memory(BiLSTM) network. We compare the effects of different embedding methods on model accuracy in different deep learning models, and the results prove that using word2vec can improve the accuracy of deep learning models. In addition, we performed visual analysis on the distributed represented sequences and found hidden similarity relationships between bases. Finally, compared with different advanced methods and data sets, the results show that our proposed method has gotten better performance in multiple evaluation aspects. Conclusions: We present a novel method for predicting miRNA target sites consisting of word2vec and the BiLSTM model and demonstrate that this method can realize automatic feature extraction and has higher accuracy. Furthermore, we process miRNA and mRNA as two languages for the first time and explore their biological significance through visual analysis.

show abstract

“…One of the most successful word embedding-based models is the word2vec model (Mikolov, Chen, Corrado, & Dean, 2013) for generating distributed representations of words and phrases. Considerable advances have been made with its standard application (Asgari & Mofrad, 2015a), with the functionality being extended to modelling for DNA (Ng, 2017), RNA (Yi et al, 2020) and protein (Asgari & Mofrad, 2015b) sequences. To briefly summarize those studies, the impact of projecting sequence data on embedded spaces is likely to reduce the complexity of the algorithms needed to solve certain tasks (e.g.…”

Section: Continuous Distributed Representations For Protein Sequencesmentioning

confidence: 99%

“…• each protein sequence is treated as a sentence, made by overlapping words (k-mers) to incorporate some context-order information in the resulting distributed representation; • the word size is 3, which seems to work properly to embed amino acid sequences for biological tasks (S. Cheng et al, 2019, Yi et al (2020); • the sequence vector is defined as the arithmetic mean of all its word vectors.…”

Section: Continuous Distributed Representations For Protein Sequencesmentioning

confidence: 99%

Determining a novel feature-space for SARS-CoV-2 sequence data

Ballesio¹,

Bangash²,

Barradas-Bautista³

et al. 2020

Preprint

View full text Add to dashboard Cite

The pandemicity & the ability of the SARS-COV-2 to reinfect a cured subject, among other damaging characteristics of it, took everybody by surprise. A global collaborative scientific effort was direly required to bring learned people from different niches of medicine & data science together. Such a platform was provided by COVID19 Virtual BioHackathon, organized from the 5th to the 11th of April, 2020, to ponder on the related pressing issues varying in their diversity from text mining to genomics. Under the "Machine learning" track, we determined optimal k-mer length for feature extraction, constructed continuous distributed representations for protein sequences to create phylogenetic trees in an alignment-free manner, and clustered predicted MHC class I and II binding affinity to aid in vaccine design. All the related work in available in a Github repository under an MIT license for future research.

show abstract

Learning distributed representations of RNA and protein sequences and its application for predicting lncRNA-protein interactions

Abstract: Graphical abstract

Cited by 40 publications

References 35 publications

RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information

RPI-SE: a stacking ensemble learning framework for ncRNA-protein interactions prediction using sequence information

A MiRNA Target Prediction Model based on Distributed Representation Learning and Deep Learning

Determining a novel feature-space for SARS-CoV-2 sequence data

Contact Info

Product

Resources

About