Modern biomedical data mining requires feature selection methods that can (1) be applied to large scale feature spaces (e.g. 'omics' data), (2) function in noisy problems, (3) detect complex patterns of association (e.g. gene-gene interactions), (4) be flexibly adapted to various problem domains and data types (e.g. genetic variants, gene expression, and clinical data) and (5) are computationally tractable. To that end, this work examines a set of filter-style feature selection algorithms inspired by the 'Relief' algorithm, i.e. Relief-Based algorithms (RBAs). We implement and expand these RBAs in an open source framework called ReBATE (Relief-Based Algorithm Training Environment). We apply a comprehensive genetic simulation study comparing existing RBAs, a proposed RBA called MultiSURF, and other established feature selection methods, over a variety of problems. The results of this study (1) support the assertion that RBAs are particularly flexible, efficient, and powerful feature selection methods that differentiate relevant features having univariate, multivariate, epistatic, or heterogeneous associations, (2) confirm the efficacy of expansions for classification vs. regression, discrete vs. continuous features, missing data, multiple classes, or class imbalance, (3) identify previously unknown limitations of specific RBAs, and (4) suggest that while MultiSURF performs best for explicitly identifying pure 2-way interactions, MultiSURF yields the most reliable feature selection performance across a wide range of problem types.
Background-A combination of biomarkers in a multivariate model may predict disease with greater accuracy than a single biomarker employed alone. We developed a non-linear method of multivariate analysis, weighted digital analysis (WDA), and evaluated its ability to predict lung cancer employing volatile biomarkers in the breath.
We sought biomarkers of breast cancer in the breath because the disease is accompanied by increased oxidative stress and induction of cytochrome P450 enzymes, both of which generate volatile organic compounds (VOCs) that are excreted in breath. We analyzed breath VOCs in 54 women with biopsy-proven breast cancer and 204 cancer-free controls, using gas chromatography/mass spectroscopy. Chromatograms were converted into a series of data points by segmenting them into 900 time slices (8 s duration, 4 s overlap) and determining their alveolar gradients (abundance in breath minus abundance in ambient room air). Monte Carlo simulations identified time slices with better than random accuracy as biomarkers of breast cancer by excluding random identifiers. Patients were randomly allocated to training sets or test sets in 2:1 data splits. In the training sets, time slices were ranked according their C-statistic values (area under curve of receiver operating characteristic), and the top ten time slices were combined in multivariate algorithms that were cross-validated in the test sets. Monte Carlo simulations identified an excess of correct over random time slices, consistent with non-random biomarkers of breast cancer in the breath. The outcomes of ten random data splits (mean (standard deviation)) in the training sets were sensitivity = 78.5% (6.14), specificity = 88.3% (5.47), C-statistic = 0.89 (0.03) and in the test sets, sensitivity = 75.3% (7.22), specificity = 84.8 (9.97), C-statistic = 0.83 (0.06). A breath test identified women with breast cancer, employing a combination of volatile biomarkers in a multivariate algorithm.
Background: Normal metabolism generates several volatile organic compounds (VOCs) that are excreted in the breath (e.g. alkanes). In patients with lung cancer, induction of high-risk cytochrome p450 genotypes may accelerate catabolism of these VOCs, so that their altered abundance in breath may provide biomarkers of lung cancer. Methods: VOCs in 1.0 L alveolar breath were analyzed in 193 subjects with primary lung cancer and 211 controls with a negative chest CT. Subjects were randomly assigned to a training set or to a prediction set in a 2:1 split. A fuzzy logic model of breath biomarkers of lung cancer was constructed in the training set and then tested in subjects in the prediction set by generating their typicality scores for lung cancer. Results: Mean typicality scores employing a 16 VOC model were significantly higher in lung cancer patients than in the control group (p < 0.0001 in all TNM stages). The model predicted primary lung cancer with 84.6% sensitivity, 80.0% specificity, and 0.88 area under curve (AUC) of the receiver operating characteristic (ROC) curve. Predictive accuracy was similar in TNM stages 1 through 4, and was not affected by current or former tobacco smoking. The predictive model achieved near-maximal performance with six breath VOCs, and was progressively degraded by random classifiers. Predictions with fuzzy logic were consistently superior to multilinear analysis. If applied to a population with 2% prevalence of lung cancer, a screening breath test would have a negative predictive value of 0.985 and a positive predictive value of 0.163 (true positive rate = 0.277, false positive rate = 0.029).
Viral infections cause increased oxidative stress, so a breath test for oxidative stress biomarkers (alkanes and alkane derivatives) might provide a new tool for early diagnosis. We studied 33 normal healthy human subjects receiving scheduled treatment with live attenuated influenza vaccine (LAIV). Each subject was his or her own control, since they were studied on day 0 prior to vaccination, and then on days 2, 7 and 14 following vaccination. Breath volatile organic compounds (VOCs) were collected with a breath collection apparatus, then analyzed by automated thermal desorption with gas chromatography and mass spectroscopy. A Monte Carlo simulation technique identified non-random VOC biomarkers of infection based on their C-statistic values (area under curve of receiver operating characteristic). Treatment with LAIV was followed by non-random changes in the abundance of breath VOCs. 2, 8-Dimethylundecane and other alkane derivatives were observed on all days. Conservative multivariate models identified vaccinated subjects on day 2 (C-statistic = 0.82, sensitivity = 63.6% and specificity = 88.5%); day 7 (C-statistic = 0.94, sensitivity = 88.5% and specificity = 92.3%); and day 14 (C-statistic = 0.95, sensitivity = 92.3% and specificity = 92.3%). The altered breath VOCs were not detected in live attenuated influenza vaccine, excluding artifactual contamination. LAIV vaccination in healthy humans elicited a prompt and sustained increase in breath biomarkers of oxidative stress. A breath test for these VOCs could potentially identify humans who are acutely infected with influenza, but who have not yet developed clinical symptoms or signs of disease.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.