A Novel Feature Selection Method for Classification of Medical Documents from Pubmed

Imambi, S. Sagar; Sudha, T.

doi:10.5120/3131-4315

Cited by 4 publications

(2 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Preprocessing comprises unstructured data and it is not feasible to classify the documents directly using data mining techniques. Preprocessing is essential in text mining and in performing, the documents are converted into a list of features or keywords by performing stop words removal and word stemming process are performed [11].Stop words are the words which occur repeatedly in the documents and they do not provide any meaning within the document. Stop words are "and", "are", "this" and so on.…”

Section: B Preprocessingmentioning

confidence: 99%

Discriminant Pearson Correlative Feature Selection based Gentle Adaboost Classification for Medical Document Mining

2019

IJRTE

View full text Add to dashboard Cite

This paper examines Discriminant Pearson Correlative Analysis Based Multivariate Gentle Adaboost Classification (DPCA-MGAC) and it is used to improve the performance of medical document mining with minimum time complexity. A large number of documents are collected from PubMed databases through the semantic-based search. Processes such as removing stop words, stemming, features identification, selection of features i.e., relevant keywords for document classification are carried out. The significant feature selection is carried out using DPCA, and with the selected features the documents are categorized into different classes using MGAC. This classification process combines the results of all weak learners and makes a strong classification in order to improve the precision of medical data mining and minimizes the false positive rate. Experimental evaluation has been performed using PubMed database.

show abstract

Section: B Preprocessingmentioning

confidence: 99%

Discriminant Pearson Correlative Feature Selection based Gentle Adaboost Classification for Medical Document Mining

2019

IJRTE

View full text Add to dashboard Cite

show abstract

“…As the word count increases, TF-IDF value also increases in a direct proportion, but is offset by the occurrence of the word in the set of documents to control for the fact that some words are generally more common than others. (Sitaula et al, 2012;Sagar et al, 2011). But an ideal document consistently makes use of synonyms for a single word so that same words generally do not repeat.…”

Section: Similarity and Performance Measuresmentioning

confidence: 99%

Multi-Step Iterative Algorithm for Feature Selection on Dynamic Documents

Bafna

Shirwaikar²,

Pramod

2016

International Journal of Information Retrieval Research

View full text Add to dashboard Cite

The authors propose clustering based multistep iterative algorithm. The important step is where terms are grouped by synonyms. It takes advantage of semantic relativity measure between the terms. Term frequency is computed of the group of synonyms by considering the relativity measure of the terms appearing in the document from the parent term in the group. This increases the importance of terms which though individually appear less frequently but together show their strong presence. The authors tried experiments on different real and artificial datasets such as NEWS 20, Reuters, emails, research papers on different topics. Resulted entropy shows that their algorithm gives improved result on certain set of documents which are well-articulated, such as research papers. The results are marginal on documents where the message is emphasized by repetitions of terms specifically the documents that are rapidly generated such as emails. The authors also observed that newly arrived documents get appropriately mapped based on proximity to the semantic group.

show abstract