A Novel Feature Selection Technique for Text Classification Using Naïve Bayes

Sarkar, Subhajit Dey; Goswami, Saptarsi; Agarwal, Aman; Aktar, Javed

doi:10.1155/2014/717092

Cited by 54 publications

(24 citation statements)

References 8 publications

(11 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Otherwise, MI value reaches the maximum when feature distribution is in intra-class only. The work presented in [26] suggested that features may convey similar information in the feature space. In conjecture to that, features evininge similar information are grouped to select the most representative features from each group.…”

Section: Feature Subset Selection (Fss)mentioning

confidence: 99%

“…The k-means clustering algorithm works iteratively to assign features to one of the k clusters based on the similar information features. To determine the optimal number of clusters (k) as mentioned in [26]…”

Section: Feature Subset Selection (Fss)mentioning

confidence: 99%

“…The conventional feature selection methods such as Chisquare (χ 2 ), Information Gain (IG), Mutual Information (MI) are used to select discriminative features from high dimensional feature spaces. However, the selected feature subset may have features which convey similar information [26]. On this line, we propose a two stage feature subset selection using conventional feature selection methods and a clustering method to select the most discriminative features from a high dimensional feature space.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Automatic Irony Detection using Feature Fusion and Ensemble Classifier

Kumar¹,

Harish²

2019

IJIMAI

View full text Add to dashboard Cite

With the advent of micro-blogging sites, users are pioneer in expressing their sentiments and emotions on global issues through text. Automatic detection and classification of sentiments like sarcastic or ironic content in microblogging reviews is a challenging task. It requires a system that manages some kind of knowledge to interpret the sentiment expressed in text. The available approaches are quite limited in their capabilities and scope to detect ironic utterances present in the text. In this regards, the paper propose feature fusion to provide knowledge to the system by alternative sets of features obtained using linguistic and content based text features. The proposed work extracts five sets of linguistic features and fuses with features selected using two stages of a feature selection method. In order to demonstrate the effectiveness of the proposed method, we conduct extensive experimentation by selecting different feature subsets. The performances of the proposed method are evaluated using Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Decision Tree (DT) and ensemble classifiers. The experimental result shows the proposed approach significantly out-performs the conventional methods.

show abstract

Section: Feature Subset Selection (Fss)mentioning

confidence: 99%

Section: Feature Subset Selection (Fss)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Automatic Irony Detection using Feature Fusion and Ensemble Classifier

Kumar¹,

Harish²

2019

IJIMAI

View full text Add to dashboard Cite

show abstract

“…A number of works can be traced in recent years addressing the problem of text classification through feature selection. Feature selection algorithms such as chisquare, information gain, and mutual information (Yang and Pedersen., 1997) though seem to be powerful techniques for text data, a number of novel feature selection algorithms based on genetic algorithm (Bharti and Singh., 2016;Ghareb et al, 2016), ant colony optimization (Dadaneh et al, 2016;Moradi and Gholampour., 2016;Uysal., 2016;Meena et al, 2012), Bayesian principle Zhang et al, 2016;Feng et al, 2012;Fenga et al, 2015;Sarkar et al, 2014), clustering of features (Bharti and Singh., 2015), global information gain (Shang et al, 2013), adaptive keyword (Tasci and Gungor., 2013), global ranking (Pinheiro et al, 2012;Pinheiro et al, 2015) are proposed.…”

Section: Related Workmentioning

confidence: 99%

Cluster Based Symbolic Representation for Skewed Text Categorization

Raju

Suhil

Guru

et al. 2017

Communications in Computer and Information Science

View full text Add to dashboard Cite

Abstract. In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed modeloutperforms the other existing models.

show abstract

“…To achieve high classification result of the Web Page Classification (WPC) system, an excellent representation of textual data (Preprocessing/DR) should contain as much information as possible from the original document [8]. Also, the accuracy of most classification algorithms depends on the quality and size of training data which is inherently dependent on the document representation technique [9]. Several researchers have contributed to the document representation stage of the web page classification system because irrelevant and redundant features often degrade the performance of the classification algorithms both in speed and classification accuracy and also its tendency to reduce overfitting [10].…”

Section: Introductionmentioning

confidence: 99%