2014
DOI: 10.1016/j.aca.2013.10.050
|View full text |Cite
|
Sign up to set email alerts
|

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

Abstract: It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets fr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
43
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
7
2

Relationship

2
7

Authors

Journals

citations
Cited by 56 publications
(44 citation statements)
references
References 54 publications
(63 reference statements)
1
43
0
Order By: Relevance
“…The first one is obtained by using the spectrum kernel (denoted by SK, the hyper-parameter kmers is set to 3 in this work) [22], and the other one is obtained by using Clustal Omega (denoted by CO) [23], for which Clustal Omega gives the distant matrix (denoted by distM), and (1 – distM) is calculated to obtain the similarity matrix. For the drug compounds, the similarity matrix is obtained by using the PubChem fingerprint (denoted by PCFP, ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt), which has been successfully applied in other research [24, 25]. As a result, we obtained two sets of combined matrices: SK-PCFP and CO-PCFP.…”
Section: Resultsmentioning
confidence: 99%
“…The first one is obtained by using the spectrum kernel (denoted by SK, the hyper-parameter kmers is set to 3 in this work) [22], and the other one is obtained by using Clustal Omega (denoted by CO) [23], for which Clustal Omega gives the distant matrix (denoted by distM), and (1 – distM) is calculated to obtain the similarity matrix. For the drug compounds, the similarity matrix is obtained by using the PubChem fingerprint (denoted by PCFP, ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt), which has been successfully applied in other research [24, 25]. As a result, we obtained two sets of combined matrices: SK-PCFP and CO-PCFP.…”
Section: Resultsmentioning
confidence: 99%
“…This review indicates sections among the PubChem resources that have not been fully explored, and highlights fields that are worthwhile for further research investigation or future improvement of PubChem: (i) the chemical probes available in PubChem, which were generated by the Molecular Libraries Initiative as small molecule tools, are to be exploited for unraveling complex biological and disease related systems (http://www.ncbi.nlm.nih.gov/books/NBK47352/); (ii) the RNAi screening data in PubChem remained largely unnoticed, which together with small molecule bioassays can provide useful insights to the biological systems under investigation, as well as to understand the genetic basis of diseases [54,55]; (iii) integration of PubChem assay targets including proteins, genes and pathways with genomic data and disease information represents other interesting but less explored research areas such as polypharmacology, drug repurposing and personalized medicine [56]; (iv) text mining on bioassay data with rich descriptions on disease and targets, and recently added patent information toward data integration for exploring drug-target-disease relationships is currently scarce; (v) the HTS data in nature are often highly imbalanced and noisy, making it challenging for data mining and modeling. Despite a number of previous attempts [5759], it still demands efforts from researchers and PubChem for developing methods to handle these issues.…”
Section: Discussionmentioning
confidence: 99%
“…More recently, Hao et al . 93 applied the synthetic minority oversampling technique (SMOTE) 96 to tackle the HTS data set imbalance issue. Unlike the traditional oversampling method, SMOTE oversamples the minority class by creating “synthetic” samples along the line segments connecting the original minority-class samples with their k -nearest neighbors (kNN).…”
Section: Dealing With Data Imbalance Issues In Pubchem Datamentioning
confidence: 99%