Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5–10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.
In cellular transportation mechanisms, the movement of ions across the cell membrane and its proper control are important for cells, especially for life processes. Ion transporters/pumps and ion channel proteins work as border guards controlling the incessant traffic of ions across cell membranes. We revisited the study of classification of transporters and ion channels from membrane proteins with a more efficient deep learning approach. Specifically, we applied multi‐window scanning filters of convolutional neural networks on almost full‐length position‐specific scoring matrices for extracting useful information. In this way, we were able to retain important evolutionary information of the proteins. Our experiment results show that a convolutional neural network with a minimum number of convolutional layers can be enough to extract the conserved information of proteins which leads to higher performance. Our best prediction models were obtained after examining different data imbalanced handling techniques, and different protein encoding methods. We also showed that our models were superior to traditional deep learning approaches on the same datasets as well as other machine learning classification algorithms.
Since 2015, a fast growing number of deep learning–based methods have been proposed for protein–ligand binding site prediction and many have achieved promising performance. These methods, however, neglect the imbalanced nature of binding site prediction problems. Traditional data-based approaches for handling data imbalance employ linear interpolation of minority class samples. Such approaches may not be fully exploited by deep neural networks on downstream tasks. We present a novel technique for balancing input classes by developing a deep neural network–based variational autoencoder (VAE) that aims to learn important attributes of the minority classes concerning nonlinear combinations. After learning, the trained VAE was used to generate new minority class samples that were later added to the original data to create a balanced dataset. Finally, a convolutional neural network was used for classification, for which we assumed that the nonlinearity could be fully integrated. As a case study, we applied our method to the identification of FAD- and FMN-binding sites of electron transport proteins. Compared with the best classifiers that use traditional machine learning algorithms, our models obtained a great improvement on sensitivity while maintaining similar or higher levels of accuracy and specificity. We also demonstrate that our method is better than other data imbalance handling techniques, such as SMOTE, ADASYN, and class weight adjustment. Additionally, our models also outperform existing predictors in predicting the same binding types. Our method is general and can be applied to other data types for prediction problems with moderate-to-heavy data imbalances.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.