SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins

Ahmad, Saeed; Charoenkwan, Phasit; Quinn, Julian M.W.; Moni, Mohammad Ali; Hasan, Md. Mehedi; Lió, Píetro; Shoombuatong, Watshara

doi:10.1038/s41598-022-08173-5

Cited by 26 publications

(12 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a sense, some traditional models, such as regression models, also have a good explainability, as we can assess the coefficients of each attribute to measure how important a feature is. These models however do not measure up in terms of effectiveness when compared to modern tree-based algorithms in many scenarios, especially in cases with larger datasets 33 . Another key difference between these models is that, in the case of regression models, we have to explicitly remove collinear variables, but these variables, even though they might not improve classification performance, still yield valid model explanations.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset

Paiva

Pereira

Andrade

et al. 2023

Sci Rep

View full text Add to dashboard Cite

The majority of early prediction scores and methods to predict COVID-19 mortality are bound by methodological flaws and technological limitations (e.g., the use of a single prediction model). Our aim is to provide a thorough comparative study that tackles those methodological issues, considering multiple techniques to build mortality prediction models, including modern machine learning (neural) algorithms and traditional statistical techniques, as well as meta-learning (ensemble) approaches. This study used a dataset from a multicenter cohort of 10,897 adult Brazilian COVID-19 patients, admitted from March/2020 to November/2021, including patients [median age 60 (interquartile range 48–71), 46% women]. We also proposed new original population-based meta-features that have not been devised in the literature. Stacking has shown to achieve the best results reported in the literature for the death prediction task, improving over previous state-of-the-art by more than 46% in Recall for predicting death, with AUROC 0.826 and MacroF1 of 65.4%. The newly proposed meta-features were highly discriminative of death, but fell short in producing large improvements in final prediction performance, demonstrating that we are possibly on the limits of the prediction capabilities that can be achieved with the current set of ML techniques and (meta-)features. Finally, we investigated how the trained models perform on different hospitals, showing that there are indeed large differences in classifier performance between different hospitals, further making the case that errors are produced by factors that cannot be modeled with the current predictors.

show abstract

Section: Resultsmentioning

confidence: 99%

“…The combination of models based on different classification premises potentially made Stacking more robust. If a single classifier makes a wrong prediction, the others can still make corrections, increasing the robustness of the final stacking model [32][33][34][35] .…”

Section: Resultsmentioning

confidence: 99%

Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset

Paiva

Pereira

Andrade

et al. 2023

Sci Rep

View full text Add to dashboard Cite

show abstract

“…In order to enhance the performance of SVM classifiers, a grid search strategy was utilized to optimize the two important aspects of the RBF kernel, including C (controls the trade-off between the misclassification rate and margin) and γ (the kernel width parameter). Although SVM often yields satisfactory prediction performances, this method is known as a black-box computation method (Ahmad et al, 2022[ 1 ]; Charoenkwan et al, 2021[ 12 ]; Li et al, 2021[ 33 ]; Wei et al, 2021[ 54 ]).…”

Section: Methodsmentioning

confidence: 99%

Recent development of machine learning-based methods for the prediction of defensin family and subfamily

Charoenkwan

Schaduangrat

Mahmud

et al. 2022

EXCLI Journal; 21:Doc757; ISSN 1611-2156

Self Cite

View full text Add to dashboard Cite

Nearly all living species comprise of host defense peptides called defensins, that are crucial for innate immunity. These peptides work by activating the immune system which kills the microbes directly or indirectly, thus providing protection to the host. Thus far, numerous preclinical and clinical trials for peptide-based drugs are currently being evaluated. Although, experimental methods can help to precisely identify the defensin peptide family and subfamily, these approaches are often time-consuming and cost-ineffective. On the other hand, machine learning (ML) methods are able to effectively employ protein sequence information without the knowledge of a protein's three-dimensional structure, thus highlighting their predictive ability for the large-scale identification. To date, several ML methods have been developed for the in silico identification of the defensin peptide family and subfamily. Therefore, summarizing the advantages and disadvantages of the existing methods is urgently needed in order to provide useful suggestions for the development and improvement of new computational models for the identification of the defensin peptide family and subfamily. With this goal in mind, we first provide a comprehensive survey on a collection of six state-of-the-art computational approaches for predicting the defensin peptide family and subfamily. Herein, we cover different important aspects, including the dataset quality, feature encoding methods, feature selection schemes, ML algorithms, cross-validation methods and web server availability/usability. Moreover, we provide our thoughts on the limitations of existing methods and future perspectives for improving the prediction performance and model interpretability. The insights and suggestions gained from this review are anticipated to serve as a valuable guidance for researchers for the development of more robust and useful predictors.

show abstract

“…A common solution is only to store k-mers that appear two or more times and to use a probabilistic data structure like a Bloom Filter or Counting Quotient Filter to remove singleton k-mers [53]. Other valuable numerical features include physicochemical properties, such as isoelectric point, aromaticity, molar extinction coefficient, instability index, molecular weight, polarity and hydrophobicity [54][55][56]. Each of these features can be calculated from the sequence alone using readily available bioinformatics tools [57].…”

Section: Homology-free Annotation Extracting Protein Sequence Featuresmentioning

confidence: 99%

“…Once a set of features has been curated, machine learning techniques are applied to differentiate between protein functions. These techniques include Support Vector Machines [58][59][60] and Artificial Neural Networks [54], as well as ensembles of multiple machine learning methods [55,56]. Several function predictors have been released as bioinformatic tools, including PhANNs (Phage Artificial Neural Networks) [54], which can classify phage proteins into one of ten structural classes with 86.2% accuracy.…”

Section: Homology-free Annotation Extracting Protein Sequence Featuresmentioning

confidence: 99%

What the protein!? Computational methods for predicting microbial protein functions

Grigson¹,

Edwards²

2023

Preprint

View full text Add to dashboard Cite

The identification of protein functions is crucial for understanding microbial life at a molecular scale. While computational methods for annotating protein sequences have greatly advanced in recent years, 30% of all bacterial and 65% of all viral protein sequences cannot be attributed a known biological function. As a result, protein function inference remains a fundamental challenge in computational biology. This paper reviews various bioinformatics methods for annotating microbial and viral proteins, categorised into homology-based and homology-free approaches. Widely used homology-based methods encompass sequence similarity searches such as BLAST and profile hidden Markov models, both of which compare novel protein sequences to databases of protein sequences with known functions. These homology-based methods have limitations, particularly for viral sequences which are severely underrepresented in protein sequence databases. As a result, homology-free methods, including numerical feature extraction, language-based models, guilt-by-association, and protein structure prediction software, offer potential alternatives. In addition, it is also important to critically consider the functional labels used to describe protein functions, and the hierarchical organisation of functional labels, regardless of the annotation method implemented. This review highlights that a combination of multiple functional prediction strategies, including machine learning, may provide the best improvements for microbial protein annotation and alleviate the ever-expanding sequence-function gap affecting microbial proteins. Overall, we provide experimental biologists with a comprehensive overview of annotation methods and inform computational scientists of open challenges and future research avenues.

show abstract

SCORPION is a stacking-based ensemble learning framework for accurate prediction of phage virion proteins

Cited by 26 publications

References 55 publications

Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset

Potential and limitations of machine meta-learning (ensemble) methods for predicting COVID-19 mortality in a large inhospital Brazilian dataset

Recent development of machine learning-based methods for the prediction of defensin family and subfamily

What the protein!? Computational methods for predicting microbial protein functions

Contact Info

Product

Resources

About