Ten quick tips for machine learning in computational biology

Chicco, Davide

doi:10.1186/s13040-017-0155-3

Cited by 695 publications

(519 citation statements)

References 52 publications

Supporting

Mentioning

466

Contrasting

Unclassified

Order By: Relevance

“…We measured the performance of the proposed methods with the area under the Precision-Recall (PR) curves (Davis & Goadrich, 2006;Chicco, 2017) and the logistic loss (also known as cross-entropy loss) function.…”

Section: Resultsmentioning

confidence: 99%

“…High dimensional data can lead to several problems: in addition to high computational costs (in memory and time), it often leads to overfitting (Van Der Maaten, Postma & Van den Herik, 2009;Chicco, 2017;Moore, 2004). Dimensionality reduction can limit these problems and, additionally, can improve the visualization and interpretation of the dataset, because it allows researchers to focus on a reduced number of features.…”

Section: Methodsmentioning

confidence: 99%

“…The hospital anonymized all the records before releasing the dataset. The dataset is now publically available on the Machine Learning Repository website of the University of California Irvine (UCI ML) (University of California Irvine, 1987), To avoid problems of the algorithm behavior related to different value ranges of each feature, we scaled all the features in our experiments using [0,1] normalization, and we input missing data using the average value (Chicco, 2017). While more complex pre-processing schemes could be introduced, such as inferring the missing value with a k-nearest neighbor model (Santos et al, 2015), we decided to use this methodology to avoid additional complexity that would make it difficult to fairly compare the explored techniques.…”

Section: Datasetmentioning

confidence: 99%

See 2 more Smart Citations

Supervised deep learning embeddings for the prediction of cervical cancer diagnosis

Fernandes

Chicco

Cardoso

et al. 2018

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

Cervical cancer remains a significant cause of mortality all around the world, even if it can be prevented and cured by removing affected tissues in early stages. Providing universal and efficient access to cervical screening programs is a challenge that requires identifying vulnerable individuals in the population, among other steps. In this work, we present a computationally automated strategy for predicting the outcome of the patient biopsy, given risk patterns from individual medical records. We propose a machine learning technique that allows a joint and fully supervised optimization of dimensionality reduction and classification models. We also build a model able to highlight relevant properties in the low dimensional space, to ease the classification of patients. We instantiated the proposed approach with deep learning architectures, and achieved accurate prediction results (top area under the curve AUC = 0.6875) which outperform previously developed methods, such as denoising autoencoders. Additionally, we explored some clinical findings from the embedding spaces, and we validated them through the medical literature, making them reliable for physicians and biomedical researchers.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Supervised deep learning embeddings for the prediction of cervical cancer diagnosis

Fernandes

Chicco

Cardoso

et al. 2018

PeerJ Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…MCC is a correlation coefficient that describes the quality of a binary correlation between the observed and predicted classification, and is unaffected by large differences in population size . It can take a value between −1 and 1, where 0 means a completely random prediction, whereas −1 and 1 means a perfectly wrong and perfect prediction, respectively.…”

Section: Methodsmentioning

confidence: 99%

Identification of phenobarbital and other barbiturates in forensic drug screening using positive electrospray ionization liquid chromatography−high resolution mass spectrometry

Høj

Mollerup

Rasmussen

et al. 2019

Drug Testing and Analysis

View full text Add to dashboard Cite

Comprehensive drug‐screening performed by liquid chromatography−high resolution mass spectrometry (LC−HRMS) enables identification of hundreds to thousands of drug compounds in a single analysis. Forensic drug screening is generally performed with positive electrospray ionization (ESI+), targeting basic drugs; however, a few toxicologically important drugs such as barbiturates, may require analysis by negative ESI. In this work, screening targets for barbiturates were determined using our LC−HRMS screening with ESI+. For several years, our forensic whole blood samples have been analyzed using the LC−HRMS−ESI+ screening in parallel with a multi‐target LC–MS/MS−ESI− method. From 2014 to 2018, 23 samples were positive for phenobarbital (0.5−81 mg/kg). Retrospective data analysis of 4816 blood samples (15 positive) revealed several potential screening targets for phenobarbital. The targets were tentatively identified by exact mass and isotopic pattern as uncommon adducts of phenobarbital and as a decomposition product of phenobarbital N‐glucoside (C17H24N2O7). Analysis of a test set containing eight positive (0.5–65 mg/kg phenobarbital) and 31 negative samples supported the use of the observed target m/z 323.0614 at 5.14 minutes, corresponding to the [M + HCOONa+Na]+ adduct of phenobarbital. The [M + HCOONa+Na]+ adduct was confirmed as a screening target for common barbiturates, by analysis of barbiturate reference standards in ESI+/ESI−. The [M + HCOONa+Na]+ adduct allowed retrospective analysis with 91% sensitivity (n = 23) and 100% specificity (n = 4855) for phenobarbital in our existing LC−HRMS−ESI+ screening. The two negative results were the two whole‐blood samples with the lowest phenobarbital concentration (<1.8 mg/kg). Thus, a specialized screening is not necessary and use of this adduct likely enables screening for other barbiturates.

show abstract

“…Grid search [57,58], Random search [59,60], Bayesian optimization [61][62][63][64], and Gradient-based optimization [65] are four existing methods of tuning parameters. In practice, Bayesian optimization has been shown to obtain better results in fewer evaluations compared to grid search and random search due to the ability to reason with respect to the quality of experiments before they are run [61][62][63][64].…”

Section: Parameter Tuningmentioning

confidence: 99%

Identifying Modes of Driving Railway Trains from GPS Trajectory Data: An Ensemble Classifier-Based Approach

Zheng

Cui

Zhang

2018

IJGI

View full text Add to dashboard Cite

Recognizing Modes of Driving Railway Trains (MDRT) can help to solve railway freight transportation problems in driver behavior research, auto-driving system design and capacity utilization optimization. Previous studies have focused on analyses and applications of MDRT, but there is currently no approach to automatically and effectively identify MDRT in the context of big data. In this study, we propose an integrated approach including data preprocessing, feature extraction, classifiers modeling, training and parameter tuning, and model evaluation to infer MDRT using GPS data. The highlights of this study are as follows: First, we propose methods for extracting Driving Segmented Standard Deviation Features (DSSDF) combined with classical features for the purpose of improving identification performances. Second, we find the most suitable classifier for identifying MDRT based on a comparison of performances of K-Nearest Neighbor, Support Vector Machines, AdaBoost, Random Forest, Gradient Boosting Decision Tree, and XGBoost. From the real-data experiment, we conclude that: (i) The ensemble classifier XGBoost produces the best performance with an accuracy of 92.70%; (ii) The group of DSSDF plays an important role in identifying MDRT with an accuracy improvement of 11.2% (using XGBoost). The proposed approach has been applied in capacity utilization optimization and new driver training for the Baoshen Railway.

show abstract

Ten quick tips for machine learning in computational biology

Cited by 695 publications

References 52 publications

Supervised deep learning embeddings for the prediction of cervical cancer diagnosis

Supervised deep learning embeddings for the prediction of cervical cancer diagnosis

Identification of phenobarbital and other barbiturates in forensic drug screening using positive electrospray ionization liquid chromatography−high resolution mass spectrometry

Identifying Modes of Driving Railway Trains from GPS Trajectory Data: An Ensemble Classifier-Based Approach

Contact Info

Product

Resources

About