2003
DOI: 10.1093/bioinformatics/btg182
|View full text |Cite
|
Sign up to set email alerts
|

Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions

Abstract: Using very simple classifiers, we show for several publicly available microarray and proteomics datasets how these curses influence classification outcomes. In particular, even if the sample per feature ratio is increased to the recommended 5-10 by feature extraction/reduction methods, dataset sparsity can render any classification result statistically suspect. In addition, several 'optimal' feature sets are typically identifiable for sparse datasets, all producing perfect classification results, both for the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
220
0
2

Year Published

2005
2005
2016
2016

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 289 publications
(223 citation statements)
references
References 44 publications
1
220
0
2
Order By: Relevance
“…For mass spectra in a biomedical context SFRs are typically in the range 1 20 -1 500 . In machine-learning approaches such as neural networks the conventional solution is to reduce the feature space dimensionality by variable selection (Somorjai et al, 2003). The multivariate methods described and used herein rely on linear algebraic operations and are transparent in contrast to neural nets (for example), which are often seen as black boxes.…”
Section: Figurementioning
confidence: 99%
“…For mass spectra in a biomedical context SFRs are typically in the range 1 20 -1 500 . In machine-learning approaches such as neural networks the conventional solution is to reduce the feature space dimensionality by variable selection (Somorjai et al, 2003). The multivariate methods described and used herein rely on linear algebraic operations and are transparent in contrast to neural nets (for example), which are often seen as black boxes.…”
Section: Figurementioning
confidence: 99%
“…6 For this reason, the features containing little discriminative power must be discarded, which means that we will need a large number of cases (about three to five times more cases than the number of features to extract) from which to extract those features. 7,8 However, in real life, the number of available cases will be limited by epidemiology and budget. This problem is called the curse of dimensionality.…”
Section: How Many Data?mentioning
confidence: 99%
“…This problem is called the curse of dimensionality. 7 It is common to try to compensate for this restriction with large multicenter studies 8 (see Clinical Trials of MRS Methods), which brings an added problem: data compatibility in the face of slightly different acquisition conditions (field strength, localization pulse sequence, echo time (TE), and recycling time, among others). These factors will introduce variability or noise into the classifier training process.…”
Section: How Many Data?mentioning
confidence: 99%
“…Recently, use of this approach for predictive toxicology using gene expression profiles has been reported by several groups. (6)(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17) In the present study, we aimed to provide a basis for a rapid and easy method to predict carcinogenicity of chemicals based on microarray technology with cultured MH1C1 rat hepatoma cells, selected as a model system to minimize complicating factors such as cell type heterogeneity and interindividual differences of animals. For this purpose, 39 chemicals that have been well characterized for carcinogenicity were first tested for cytotoxicity in cells using reductase activity at 3 days and non-toxic doses were determined for measurement of gene expression.…”
mentioning
confidence: 99%