Plenty of microbes in our human body play a vital role in the process of cell physiology. In recent years, there is accumulating evidence indicating that microbes are closely related to many complex human diseases. In-depth investigation of disease-associated microbes can contribute to understanding the pathogenesis of diseases and thus provide novel strategies for the treatment, diagnosis, and prevention of diseases. To date, many computational models have been proposed for predicting microbe-disease associations using available similarity networks. However, these similarity networks are not effectively fused. In this study, we proposed a novel computational model based on multi-data integration and network consistency projection for Human Microbe-Disease Associations Prediction (HMDA-Pred), which fuses multiple similarity networks by a linear network fusion method. HMDA-Pred yielded AUC values of 0.9589 and 0.9361 ± 0.0037 in the experiments of leave-one-out cross validation (LOOCV) and 5-fold cross validation (5-fold CV), respectively. Furthermore, in case studies, 10, 8, and 10 out of the top 10 predicted microbes of asthma, colon cancer, and inflammatory bowel disease were confirmed by the literatures, respectively.
Terminator is a DNA sequence that gives the RNA polymerase the transcriptional termination signal. Identifying terminators correctly can optimize the genome annotation, more importantly, it has considerable application value in disease diagnosis and therapies. However, accurate prediction methods are deficient and in urgent need. Therefore, we proposed a prediction method "iterb-PPse" for terminators by incorporating 47 nucleotide properties into PseKNC-I and PseKNC-II and utilizing Extreme Gradient Boosting to predict terminators based on Escherichia coli and Bacillus subtilis. Combing with the preceding methods, we employed three new feature extraction methods K-pwm, Base-content, Nucleotidepro to formulate raw samples. The two-step method was applied to select features. When identifying terminators based on optimized features, we compared five single models as well as 16 ensemble models. As a result, the accuracy of our method on benchmark dataset achieved 99.88%, higher than the existing state-of-the-art predictor iTerm-PseKNC in 100 times fivefold cross-validation test. Its prediction accuracy for two independent datasets reached 94.24% and 99.45% respectively. For the convenience of users, we developed a software on the basis of "iterb-PPse" with the same name. The open software and source code of "iterb-PPse" are available at https://github.com/Sarahyouzi/iterb-PPse.
Background The origin is the starting site of DNA replication, an extremely vital part of the informational inheritance between parents and children. More importantly, accurately identifying the origin of replication has great application value in the diagnosis and treatment of diseases related to genetic information errors, while the traditional biological experimental methods are time-consuming and laborious. Results We carried out research on the origin of replication in a variety of eukaryotes and proposed a unique prediction method for each species. Throughout the experiment, we collected data from 7 species, including Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Kluyveromyces lactis, Pichia pastoris and Schizosaccharomyces pombe. In addition to the commonly used sequence feature extraction methods PseKNC-II and Base-content, we designed a feature extraction method based on TF-IDF. Then the two-step method was utilized for feature selection. After comparing a variety of traditional machine learning classification models, the multi-layer perceptron was employed as the classification algorithm. Ultimately, the data and codes involved in the experiment are available at https://github.com/Sarahyouzi/EukOriginPredict. Conclusions The prediction accuracy of the training set of the above-mentioned seven species after 100 times fivefold cross validation reach 92.60%, 90.80%, 91.22%, 96.15%, 96.72%, 99.86%, 96.72%, respectively. It denotes that compared with other methods, the methods we designed could accomplish superior performance. In addition, our experiments reveals that the models of multiple species could predict each other with high accuracy, and the results of STREME shows that they have a certain common motif.
23Terminator is a DNA sequence that give the RNA polymerase the transcriptional 24 termination signal. Identifying terminators correctly can optimize the genome 25 annotation, more importantly, it has considerable application value in disease diagnosis 26 and therapies. However, accurate prediction methods are deficient and in urgent need.27 Therefore, we proposed a prediction method "iterb-PPse" for terminators by 28 incorporating 47 nucleotide properties into PseKNC-Ⅰ and PseKNC-Ⅱ and utilizing 29 Extreme Gradient Boosting to predict terminators based on Escherichia coli and 30 Bacillus subtilis. Combing with the preceding methods, we employed three new feature 31 extraction methods K-pwm, Base-content, Nucleotidepro to formulate raw samples. 32 The two-step method was applied to select features. When identifying terminators 33 based on optimized features, we compared five single models as well as 16 ensemble 34 models. As a result, the accuracy of our method on benchmark dataset achieved 35 99.88%, higher than the existing state-of-the-art predictor iTerm-PseKNC in 100 times 36 five-fold cross-validation test. It's prediction accuracy for two independent datasets 37 reached 94.24% and 99.45% respectively. For the convenience of users, a software was 38 developed with the same name on the basis of "iterb-PPse". The open software and 39 source code of "iterb-PPse" are available at https://github.com/Sarahyouzi/iterb-PPse. 3 40 1 Introduction 41 DNA transcription is an important step in the inheritance of genetic information 42 and terminators control the termination of transcription which exists in sequences that 43 have been transcribed. When transcription, the terminator will give the RNA 44 polymerase the transcriptional termination signal. Identifying terminators accurately 45 can optimize the genome annotation, more importantly, it has great application value 46 in disease diagnosis and therapies, so it is crucial to identify terminators. Whereas, 47 using traditional biological experiments to identify terminators is extremely time 48 consuming and labor intensive. Therefore, a more effective and convenient began to be 49 applied in researches, that is, adopting machine learning to identify gene sequences. 50 Previous research found there are two types of terminators in prokaryotes, namely 51 Rho-dependent and Rho-independent[1], as shown in Fig 1. Although there have been 52 a lot of studies on the prediction of terminators, most of them only focused on one kind 53 of them. In 2004, Wan XF, Xu D et al. proposed a prediction method for Rho-54 independent terminators with an accuracy of 92.25%. In 2005, Michiel J. L. de Hoon 55 et al. studied the sequence of Rho-independent terminators in B. subtilis[2], and the 56 final prediction accuracy was 94%. In 2011, Magali Naville et al. conducted a research 57 on Rho-dependent transcriptional terminators[3]. They used two published algorithms, 58 Erpin and RNA motif, to predict terminators. The specificity and sensitivity of the final 59 results were 95.3% and 87.8...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.