Sentiment classification is increasingly used to automatically identify a positive or negative sentiment in a text review. In classification, feature selection had always been a critical and challenging problem. Most of the related feature selection for sentiment classification techniques unable to overcome problems of evaluating the significant features that will reduce the classification performance. This paper proposes an enhanced hybrid feature selection technique to improve the sentiment classification based on machine learning approaches. First, two customer review datasets namely Sentiment Labelled and large IMDB are retrieved and pre-processed. Next, the proposed feature selection technique which is the hybridization of Term Frequency-Inverse Document Frequency (TF-IDF) and Supports Vector Machine (SVM-RFE) is developed and tested on these two datasets. TF-IDF aims to measure features importance. The SVM-RFE iteratively evaluates and ranks the features. For sentiment classification, only the ktop features from the ranked features will be used. Finally, the Support Vector Machine (SVM) classifier is deployed to observe the performance of the proposed technique. The performance is measured using accuracy, precision, recall, and F-measure. The experimental results show promising performances with 84.54% to 89.56% in the measurements especially from the large IMDB dataset. The results also outperformed other related techniques in certain datasets. Consequently, the proposed technique able to reduce from 19.25% to 70.5% of the features to be classified. This reduction rate is significant in optimally utilizing the computational resources while maintaining the efficiency of the classification performance.
Text documents are unstructured and high dimensional. Effective feature selection is required to select the most important and significant feature from the sparse feature space. Thus, this paper proposed an embedded feature selection technique based on Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for unstructured and high dimensional text classificationhis technique has the ability to measure the feature’s importance in a high-dimensional text document. In addition, it aims to increase the efficiency of the feature selection. Hence, obtaining a promising text classification accuracy. TF-IDF act as a filter approach which measures features importance of the text documents at the first stage. SVM-RFE utilized a backward feature elimination scheme to recursively remove insignificant features from the filtered feature subsets at the second stage. This research executes sets of experiments using a text document retrieved from a benchmark repository comprising a collection of Twitter posts. Pre-processing processes are applied to extract relevant features. After that, the pre-processed features are divided into training and testing datasets. Next, feature selection is implemented on the training dataset by calculating the TF-IDF score for each feature. SVM-RFE is applied for feature ranking as the next feature selection step. Only top-rank features will be selected for text classification using the SVM classifier. Based on the experiments, it shows that the proposed technique able to achieve 98% accuracy that outperformed other existing techniques. In conclusion, the proposed technique able to select the significant features in the unstructured and high dimensional text document.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.