Using random forests for assistance in the curation of G-protein coupled receptor databases

Shkurin, Aleksei; Vellido, Alfredo

doi:10.1186/s12938-017-0357-4

Cited by 8 publications

(8 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multivariate models were developed with candidate variables that were significant in the univariate analysis using logistic regression and random forest analysis, and their predictive capabilities for PTS were compared. The logistic regression model is binary whereas the random forest creates multiple training sets for decision trees, wherein each tree is built based on a bootstrap sample drawn randomly from the original dataset using the CART method and the Decrease Gini Impurity as the splitting criterion (12). Furthermore, at each branching, only a given number of randomly selected features were considered as candidates.…”

Section: Model Developmentmentioning

confidence: 99%

Development and validation of predictive models for Crohn’s disease patients with prothrombotic state: a 6-year clinical analysis

Pan¹,

Lu²,

Li³

et al. 2021

Ann Palliat Med

View full text Add to dashboard Cite

Background: Crohn's disease (CD) is a chronic idiopathic inflammatory disease. Studies show that multiple risk factors during disease progression can lead to a prothrombotic state (PTS), which predisposes the patient to thrombosis. Therefore, predicting PTS can help identify patients at risk of thrombosis. The aim of our study was to classify CD patients through D-dimer levels, and construct a prediction model for PTS. Methods:The clinical and laboratory data parameters were extracted from a retrospective observational cohort. The factors significantly associated with PTS were determined by univariate analysis, and the importance rankings were calculated. Two multivariate models were then constructed using these factors to predict PTS in CD using logistic regression and random forest analysis.Results: A total of 744 CD patients were included in the study, of which 116 were in PTS. The significant PTS-related factors were older patients, isolated colonic involvement, penetrating behavior, fever symptom, disease activity, abdominal surgery, lymphocyte counts, hematocrit levels, erythrocyte sedimentation rate, C-reactive protein, hematocrit, mean corpuscular volume levels and albumin. Multivariate logistic regression and random forest models predicted PTS with the accuracy of 89.73% and 90.63% respectively, and the corresponding AUC were 0.76 and 0.84. Conclusions:Two predictive models based on clinical and laboratory variables accurately identified CD patients with PTS with high precision.

show abstract

Section: Model Developmentmentioning

confidence: 99%

Development and validation of predictive models for Crohn’s disease patients with prothrombotic state: a 6-year clinical analysis

Pan¹,

Lu²,

Li³

et al. 2021

Ann Palliat Med

View full text Add to dashboard Cite

show abstract

“…Work on the 2011 version of the database provided evidence of clearly defined limits to the separability of the different class C subtypes. This evidence was produced using both supervised 25 , 26 and semi-supervised 22 machine learning approaches and from different data transformation strategies. Interestingly, the subtypes shown to be most responsible for such lack of complete subtype separability were precisely those which were removed in the 2016 versions of the databases (namely vomeronasal, odorant and pheromone receptors).…”

Section: Datamentioning

confidence: 99%

“…Subsequent work reported in 26 , which again employed alignment-free data transformations, used a Random Forest (RF) classifier 36 to further investigate the consistency of misclassification in this problem. Note that RF is an ensemble learning technique 37 with an internal classification voting system that is naturally suited to classification consistency analyses.…”

Section: Datamentioning

confidence: 99%

See 1 more Smart Citation

Using machine learning tools for protein database biocuration assistance

König

Shaim

Vellido

et al. 2018

Sci Rep

Self Cite

View full text Add to dashboard Cite

Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.

show abstract

“…Although assessing how complete or correct a large data set may be remains a challenge, examples have been reported. Examples include computational methods for identifying data updates and artifacts that may be of interest to downstream data consumers [ 1 ], machine learning methods to identify incorrectly classified G-protein coupled receptors [ 2 ], and to improve the quality of large data sets prior to quantitative structure-activity relationship modeling [ 3 ]. The completeness and quality of curated nanomaterial data has also been explored [ 4 ].…”

Section: Introductionmentioning

confidence: 99%

A statistical approach to identify, monitor, and manage incomplete curated data sets

Howe

2018

BMC Bioinformatics

View full text Add to dashboard Cite

BackgroundMany biological knowledge bases gather data through expert curation of published literature. High data volume, selective partial curation, delays in access, and publication of data prior to the ability to curate it can result in incomplete curation of published data. Knowing which data sets are incomplete and how incomplete they are remains a challenge. Awareness that a data set may be incomplete is important for proper interpretation, to avoiding flawed hypothesis generation, and can justify further exploration of published literature for additional relevant data. Computational methods to assess data set completeness are needed. One such method is presented here.ResultsIn this work, a multivariate linear regression model was used to identify genes in the Zebrafish Information Network (ZFIN) Database having incomplete curated gene expression data sets. Starting with 36,655 gene records from ZFIN, data aggregation, cleansing, and filtering reduced the set to 9870 gene records suitable for training and testing the model to predict the number of expression experiments per gene. Feature engineering and selection identified the following predictive variables: the number of journal publications; the number of journal publications already attributed for gene expression annotation; the percent of journal publications already attributed for expression data; the gene symbol; and the number of transgenic constructs associated with each gene. Twenty-five percent of the gene records (2483 genes) were used to train the model. The remaining 7387 genes were used to test the model. One hundred and twenty-two and 165 of the 7387 tested genes were identified as missing expression annotations based on their residuals being outside the model lower or upper 95% confidence interval respectively. The model had precision of 0.97 and recall of 0.71 at the negative 95% confidence interval and precision of 0.76 and recall of 0.73 at the positive 95% confidence interval.ConclusionsThis method can be used to identify data sets that are incompletely curated, as demonstrated using the gene expression data set from ZFIN. This information can help both database resources and data consumers gauge when it may be useful to look further for published data to augment the existing expertly curated information.Electronic supplementary materialThe online version of this article (10.1186/s12859-018-2121-6) contains supplementary material, which is available to authorized users.

show abstract

Using random forests for assistance in the curation of G-protein coupled receptor databases

Cited by 8 publications

References 22 publications

Development and validation of predictive models for Crohn’s disease patients with prothrombotic state: a 6-year clinical analysis

Development and validation of predictive models for Crohn’s disease patients with prothrombotic state: a 6-year clinical analysis

Using machine learning tools for protein database biocuration assistance

A statistical approach to identify, monitor, and manage incomplete curated data sets

Contact Info

Product

Resources

About