The human cytochrome P450 (CYP) superfamily
holds responsibilities
for the metabolism of both endogenous and exogenous compounds such
as drugs, cellular metabolites, and toxins. The inhibition exerted
on the CYP enzymes is closely associated with adverse drug reactions
encompassing metabolic failures and induced side effects. In modern
drug discovery, identification of potential CYP inhibitors is, therefore,
highly essential. Alongside experimental approaches, numerous computational
models have been proposed to address this biochemical issue. In this
study, we introduce iCYP-MFE, a computational framework for virtual
screening on CYP inhibitors toward 1A2, 2C9, 2C19, 2D6, and 3A4 isoforms.
iCYP-MFE contains a set of five robust, stable, and effective prediction
models developed using multitask learning incorporated with molecular
fingerprint-embedded features. The results show that multitask learning
can remarkably leverage useful information from related tasks to promote
global performance. Comparative analysis indicates that iCYP-MFE achieves
three predominant tasks, one equivalent task, and one less effective
task compared to state-of-the-art methods. The area under the receiver
operating characteristic curve (AUC-ROC) and the area under the precision-recall
curve (AUC-PR) were two decisive metrics used for model evaluation.
The prediction task for CYP2D6-inhibition achieves the highest AUC-ROC
value of 0.93 while the prediction task for CYP1A2-inhibition obtains
the highest AUC-PR value of 0.92. The substructural analysis preliminarily
explains the nature of the CYP-inhibitory activity of compounds. An
online web server for iCYP-MFE with a user-friendly interface was
also deployed to support scientific communities in identifying CYP
inhibitors.
Cancer is one of the most deadly
diseases that annually kills millions
of people worldwide. The investigation on anticancer medicines has
never ceased to seek better and more adaptive agents with fewer side
effects. Besides chemically synthetic anticancer compounds, natural
products are scientifically proved as a highly potential alternative
source for anticancer drug discovery. Along with experimental approaches
being used to find anticancer drug candidates, computational approaches
have been developed to virtually screen for potential anticancer compounds.
In this study, we construct an ensemble computational framework, called
iANP-EC, using machine learning approaches incorporated with evolutionary
computation. Four learning algorithms (k-NN, SVM,
RF, and XGB) and four molecular representation schemes are used to
build a set of classifiers, among which the top-four best-performing
classifiers are selected to form an ensemble classifier. Particle
swarm optimization (PSO) is used to optimise the weights used to combined
the four top classifiers. The models are developed by a set of curated
997 compounds which are collected from the NPACT and CancerHSP databases.
The results show that iANP-EC is a stable, robust, and effective framework
that achieves an AUC-ROC value of 0.9193 and an AUC-PR value of 0.8366.
The comparative analysis of molecular substructures between natural
anticarcinogens and nonanticarcinogens partially unveils several key
substructures that drive anticancerous activities. We also deploy
the proposed ensemble model as an online web server with a user-friendly
interface to support the research community in identifying natural
products with anticancer activities.
Background
Promoters, non-coding DNA sequences located at upstream regions of the transcription start site of genes/gene clusters, are essential regulatory elements for the initiation and regulation of transcriptional processes. Furthermore, identifying promoters in DNA sequences and genomes significantly contributes to discovering entire structures of genes of interest. Therefore, exploration of promoter regions is one of the most imperative topics in molecular genetics and biology. Besides experimental techniques, computational methods have been developed to predict promoters. In this study, we propose iPromoter-Seqvec – an efficient computational model to predict TATA and non-TATA promoters in human and mouse genomes using bidirectional long short-term memory neural networks in combination with sequence-embedded features extracted from input sequences. The promoter and non-promoter sequences were retrieved from the Eukaryotic Promoter database and then were refined to create four benchmark datasets.
Results
The area under the receiver operating characteristic curve (AUCROC) and the area under the precision-recall curve (AUCPR) were used as two key metrics to evaluate model performance. Results on independent test sets showed that iPromoter-Seqvec outperformed other state-of-the-art methods with AUCROC values ranging from 0.85 to 0.99 and AUCPR values ranging from 0.86 to 0.99. Models predicting TATA promoters in both species had slightly higher predictive power compared to those predicting non-TATA promoters. With a novel idea of constructing artificial non-promoter sequences based on promoter sequences, our models were able to learn highly specific characteristics discriminating promoters from non-promoters to improve predictive efficiency.
Conclusions
iPromoter-Seqvec is a stable and robust model for predicting both TATA and non-TATA promoters in human and mouse genomes. Our proposed method was also deployed as an online web server with a user-friendly interface to support research communities. Links to our source codes and web server are available at https://github.com/mldlproject/2022-iPromoter-Seqvec.
Malaria
is a threatening disease that has claimed many lives and
has a high prevalence rate annually. Through the past decade, there
have been many studies to uncover effective antimalarial compounds
to combat this disease. Alongside chemically synthesized chemicals,
a number of natural compounds have also been proven to be as effective
in their antimalarial properties. Besides experimental approaches
to investigate antimalarial activities in natural products, computational
methods have been developed with satisfactory outcomes obtained. In
this study, we propose a novel molecular encoding scheme based on
Bidirectional Encoder Representations from Transformers and used our
pretrained encoding model called NPBERT with four machine learning
algorithms, including k-Nearest Neighbors (k-NN), Support Vector Machines (SVM), eXtreme Gradient Boosting
(XGB), and Random Forest (RF), to develop various prediction models
to identify antimalarial natural products. The results show that SVM
models are the best-performing classifiers, followed by the XGB, k-NN, and RF models. Additionally, comparative analysis
between our proposed molecular encoding scheme and existing state-of-the-art
methods indicates that NPBERT is more effective compared to the others.
Moreover, the deployment of transformers in constructing molecular
encoders is not limited to this study but can be utilized for other
biomedical applications.
Nonclassical secreted proteins (NSPs) refer to a group of proteins released into the extracellular environment under the facilitation of different biological transporting pathways apart from the Sec/Tat system. As experimental determination of NSPs is often costly and requires skilled handling techniques, computational approaches are necessary. In this study, we introduce iNSP‐GCAAP, a computational prediction framework, to identify NSPs. We propose using global composition of a customized set of amino acid properties to encode sequence data and use the random forest (RF) algorithm for classification. We used the training dataset introduced by Zhang et al. (Bioinformatics, 36(3), 704–712, 2020) to develop our model and test it with the independent test set in the same study. The area under the receiver operating characteristic curve on that test set was 0.9256, which outperformed other state‐of‐the‐art methods using the same datasets. Our framework is also deployed as a user‐friendly web‐based application to support the research community to predict NSPs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.