A New Feature Selection Method for Text Classification Based on Independent Feature Space Search

Liu, Yong; Ju, Shenggen; Wang, Junfeng; Su, Chong

doi:10.1155/2020/6076272

Cited by 22 publications

(14 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The latest advances in feature selection are a combination of feature selection with deep learning especially the Convolutional Neural Networks (CNN) for classification tasks, such as applications in bioinformatics neurodegenerative disorders classification using the Principal Components Analysis (PCA) algorithm [112,113], brain tumor segmentation [114] using three planar super pixel based statistical and textural features extraction. Next, remote sensing imagery classification using a fusion of CNN and RF [115], and software fault prediction [116] using enhanced binary moth flame optimization as a feature selection, and text classification based on independent feature space search [117].…”

Section: Evaluation Performance and Discussionmentioning

confidence: 99%

Selecting critical features for data classification based on machine learning methods

et al. 2020

View full text Add to dashboard Cite

In machine learning problems, high dimensional data, especially in terms of many features, is increasingly these days [1]. Many researchers focus on the experiment to solve these problems. Besides, to extract important features from these high dimensional of variables and data. The statistical techniques were used to minimize noise and redundant data. Nevertheless, we do not use all the features to train a model. We may improve our model with the features correlated and non-redundant, so feature selection plays an important role.

show abstract

Section: Evaluation Performance and Discussionmentioning

confidence: 99%

Selecting critical features for data classification based on machine learning methods

et al. 2020

View full text Add to dashboard Cite

show abstract

“…We selected words based on their ability to discriminate between T and C, via a combination of filter and embedded methods [56-60] resulting in 41,664 words (Figure 2). The T document vectors were reduced to include only the 41,664 words (see Figure 1.e for a truncated example).…”

Section: Methodsmentioning

confidence: 99%

Potential Blood Transfusion Adverse Events Can be Found in Unstructured Text in Electronic Health Records using the “Shakespeare Method”

Bright

Rankin

Dowdy

et al. 2021

Preprint

View full text Add to dashboard Cite

BackgroundText in electronic health records (EHRs) and big data tools offer the opportunity for surveillance of adverse events (patient harm associated with medical care) (AEs) in the unstructured notes. Writers may explicitly state an apparent association between treatment and adverse outcome (“attributed”) or state the simple treatment and outcome without an association (“unattributed”). We chose the case of transfusion adverse events (TAEs) and potential TAEs (PTAEs) because real dates were obscured in the study data, and new TAE types were becoming recognized during the study data period.ObjectiveDevelop a new method to identify attributed and unattributed potential adverse events using the unstructured text of EHRs.MethodsWe used EHRs for adult critical care admissions at a major teaching hospital, 2001-2012. We formed a transfusion (T) group (21,443 admissions treated with packed red blood cells, platelets, or plasma), excluded 2,373 ambiguous admissions, and formed a comparison (C) group of 25,468 admissions. We concatenated the text notes for each admission, sorted by date, into one document, and deleted replicate sentences and lists. We identified statistically significant words in T vs. C. T documents were filtered to those words, followed by topic modeling on the T filtered documents to produce 45 topics.For each topic, the three documents with the maximum topic scores were manually reviewed to identify events that occurred shortly after the first transfusion; documents with clear alternative explanations for heart, lung, and volume overload problems (e.g., advanced cancer, lung infection) were excluded. We also reviewed documents with the most topics, as well as 20 randomly selected T documents without alternate explanations.ResultsTopics centered around medical conditions. The average number of significant topics was 6.1. Most PTAEs were not attributed to transfusion in the notes.Admissions with a top-scoring cardiovascular topic (heart valve repair, tapped pericardial effusion, coronary artery bypass graft, heart attack, or vascular repair) were more likely than random T admissions to have at least one heart PTAE (heart rhythm changes or hypotension, proportion difference = 0.47, p = 0.022). Admissions with a top-scoring pulmonary topic (mechanical ventilation, acute respiratory distress syndrome, inhaled nitric oxide) were more likely than random T admissions (proportion difference = 0.37, p = 0.049) to have at least one lung PTAE (hypoxia, mechanical ventilation, bilateral pulmonary effusion, or pulmonary edema).ConclusionsThe “Shakespeare Method” could be a useful supplement to AE reporting and surveillance of structured EHR data. Future improvements should include automation of the manual review process.

show abstract

“…Peng & Fan [52] 2017 By optimizing lower bound of conditional mutual information SFR [54] 2018 Uses subspace feature clustering to identify feature clusters CFS [55] 2018 Similar to MRMR and uses composition of feature relevancy Wang et al [59] 2019 Uses rough set theory based relative neighborhood self-information on both lower and upper approximations. PRFS [60] 2020 Proportional Rough Feature Selection based on rough set for regional distinction Liu et al [61] 2020 Independent feature space search using relative doc-term frequency difference for class correlation and redundancy Hossny et al [62] 2020 Uses text mining specifics e.g., word count, word forms such as n-gram, skip-gram, etc. Gao et al [65] 2020 min-redundancy and max-dependency (MRMD) using relevancy with a class given selected features…”

Section: Selection Methods Year Key Idea/advantage/applicationmentioning

confidence: 99%

A Fast Non-Redundant Feature Selection Technique for Text Data

et al. 2020

View full text Add to dashboard Cite

Feature selection is critical in reducing the size of data and improving classifier accuracy by selecting an optimum subset of the overall features. Traditionally, each feature is given a score against a particular category (such as using Mutual Information) and the task of feature selection comes down to choosing the top k ranked features with the best average score across all categories. However, this approach has two major drawbacks. Firstly, the maximum or average score of a feature with a class might not necessarily determine its discriminating strength among samples of other classes. Secondly, most feature selection methods only use the scores to select the discriminating features from the corpus without taking into account the redundancy of information provided by the selected features. In this paper, we propose a new feature ranking score measure called the Discriminative Mutual Information (DMI) score. This score helps to select features that distinguish samples of one category against all other categories. Moreover, Non-Redundant Feature Selection (NRFS) heuristic is also proposed that explicitly takes the problem of feature redundancy into account when selecting the features set. The performance of our approach is investigated and compared with other feature selection techniques on datasets derived from high-dimensional text corpora using multiple classification algorithms. The results show that the proposed method leads to better classification micro-F1 score as compared to other state-of-the-art methods. In particular, the proposed method shows great improvement when the number of selected features are small as well as an overall higher robustness to label noise.

show abstract

A New Feature Selection Method for Text Classification Based on Independent Feature Space Search

Cited by 22 publications

References 31 publications

Selecting critical features for data classification based on machine learning methods

Selecting critical features for data classification based on machine learning methods

Potential Blood Transfusion Adverse Events Can be Found in Unstructured Text in Electronic Health Records using the “Shakespeare Method”

A Fast Non-Redundant Feature Selection Technique for Text Data

Contact Info

Product

Resources

About