We describe a method for sparse feature selection for a class of problems motivated by our work in Computer-Aided Detection (CAD) systems for identifying structures of interest in medical images. Typical CAD data sets for classification are large (several thousand candidates) and unbalanced (significantly fewer than 1% of the candidates are "positive"). To be accepted by physicians, CAD systems must generalize well with extremely high sensitivity and very few false positives. In order to find the features that can lead to superior generalization, researchers typically generate a large number of experimental features for each candidate. The reason for such a large number of features is that there are no definitive methods for capturing the shape and image-based characteristics that correspond to the diagnostic features used by physicians to identify structures of interest in the image -for example, cancerous polyps in a CT (computed tomography) volume of a patient's colon. Thus several (100+) shape, texture, and intensity based features may be generated for each candidate at various levels of resolution. We propose a sparse formulation for Fisher Linear Discriminant (FLD) that scales well to large datasets; our method inherits all the desirable properties of FLD, while improving on handling large numbers of irrelevant and redundant features. We demonstrate that our sparse FLD formulation outperforms conventional FLD and two other methods for feature selection from the literature on both an artificial dataset and a real-world Colon CAD dataset.
Existing patient records are a valuable resource for automated outcomes analysis and knowledge discovery. However, key clinical data in these records is typically recorded in unstructured form as free text and images, and most structured clinical information is poorly organized. Time-consuming interpretation and analysis is required to convert these records into structured clinical data. Thus, only a tiny fraction of this resource is utilized. We present REMIND, a Bayesian Framework for Reliable Extraction and Meaningful Inference from Nonstructured Data. REMIND integrates and blends the structured and unstructured clinical data in patient records to automatically created highquality structured clinical data. This structuring allows existing patient records to be mined for quality assurance, regulatory compliance, and to relate financial and clinical factors. We demonstrate REMIND on two medical applications: (a) Extract "recurrence", the key outcome for measuring treatment effectiveness, for colon cancer patients (ii) Extract key diagnoses and complications for acute myocardial infarction (heart attack) patients, and demonstrate the impact of these clinical factors on financial outcomes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.