Feature Selection for Text Classification Using Machine Learning Approaches

Thirumoorthy, K.; Muneeswaran, K.

doi:10.1007/s40009-021-01043-0

Cited by 20 publications

(12 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Combining the paired-input nonlinear knockoff filter with the MLP in [27]. Random forest in [28], Naive Bayes, and SVM classifiers [29] are other algorithms used in recent studies of FS problems.…”

Section: Literature Reviewmentioning

confidence: 99%

An Improved DeepNN with Feature Ranking for Covid-19 Detection

El-Attar¹,

Sabbeh²,

Fasihuddin³

et al. 2022

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

The outbreak of Covid-19 has taken the lives of many patients so far. The symptoms of COVID-19 include muscle pains, loss of taste and smell, coughs, fever, and sore throat, which can lead to severe cases of breathing difficulties, organ failure, and death. Thus, the early detection of the virus is very crucial. COVID-19 can be detected using clinical tests, making us need to know the most important symptoms/features that can enhance the decision process. In this work, we propose a modified multilayer perceptron (MLP) with feature selection (MLPFS) to predict the positive COVID-19 cases based on symptoms and features from patients' electronic medical records (EMR). MLPFS model includes a layer that identifies the most informative symptoms to minimize the number of symptoms base on their relative importance. Training the model with only the highest informative symptoms can fasten the learning process and increase accuracy. Experiments were conducted using three different COVID-19 datasets and eight different models, including the proposed MLPFS. Results show that MLPFS achieves the best feature reduction across all datasets compared to all other experimented models. Additionally, it outperforms the other models in classification results as well as time.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

An Improved DeepNN with Feature Ranking for Covid-19 Detection

El-Attar¹,

Sabbeh²,

Fasihuddin³

et al. 2022

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

show abstract

“…As early as the 1970s, Salton et al [13] proposed vector space model (VSM), which was successfully applied to the famous SMART system. In the following 50 years, text classification has been mainly based on shallow learning model, for example, naive Bayes-based text classification method, K-nearest neighbor method, and support vector machine method [14][15][16]. Although these methods have improved accuracy, they all rely on complex feature processing engineering and do not take into account the semantic information of the text.…”

Section: Related Workmentioning

confidence: 99%

Hotel Review Classification Based on the Text Pretraining Heterogeneous Graph Neural Network Model

Zhang

Guo

Kang

et al. 2022

Computational Intelligence and Neuroscience

View full text Add to dashboard Cite

With the amount of online information continuously growing, it becomes more and more important for online stores to recommend corresponding products precisely based on users’ preferences. Reviews for various products can be of great help for the recommendation task. However, most recommendation platforms only classify positive and negative reviews based on sentiment analysis, without considering the actual demands of users, and it will reduce the effectiveness on classification task. To count this issue, we propose a new model, which integrates heterogeneous neural network and text pretraining model into this task, and compare this model with others on a travel type classification task. The model combines a pretrained text model named Bidirectional Encoder Representation from Transformers (BERT) and heterogeneous graph attention network (HGAN). Firstly, we do a fine-tuning task on BERT by a dataset consisting of 1.4 million hotel reviews from the Ctrip website to obtain fine representations of trip-related words. Then, we proposed the similarity fussy-matching method to get the main topic of each review. Then, we construct a heterogeneous neural network and apply the attention mechanism to it to mine the preference of users for traveling. Finally, the classification task is done based on each user’s preference. In Section 5 of this study, we do an experiment, which compares our model with five others. The results show that the accuracy of ours is 70%, which is higher than the other five models.

show abstract

“…As Camastra and Vinciarelli mention [ 35 ], using more features than is strictly necessary leads to several problems, pointing out that one of the main problems was the space needed to store the data. As the amount of available information increases, the compression for storage becomes even more critical [ 12 , 36 , 37 ]. Additionally, for the scope of this work, it cannot be ignored that the application of dimensional reduction techniques for reducing pre-computed embedding dimensions neither improves the runtime nor the memory requirement for running the models.…”

Section: Related Workmentioning

confidence: 99%

“…Similarly, the use of dimension reduction techniques is likewise interesting in Semantic Similarity [ 27 , 37 ]. As discussed previously, in the Semantic Similarity task, the linear O ( N ) complexity of cosine similarity is one of the reasons why this distance metric is widely used in the community and this study.…”

Section: Related Workmentioning

confidence: 99%

“…To the best of our knowledge, in the literature, dimension reduction research on embeddings has focused on statistical methods, such as Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF) [ 27 , 37 ], and classical pre-computed word embeddings, including the popular GloVe or FastText embeddings [ 21 – 24 , 36 , 49 ]. These classical word embeddings are more complex and powerful than statistical methods.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Dimensionality Reduction Techniques in Multilingual Transformers

Huertas-García

Martín²,

Huertas‐Tato³

et al. 2022

Cogn Comput

View full text Add to dashboard Cite

In scientific literature and industry, semantic and context-aware Natural Language Processing-based solutions have been gaining importance in recent years. The possibilities and performance shown by these models when dealing with complex Human Language Understanding tasks are unquestionable, from conversational agents to the fight against disinformation in social networks. In addition, considerable attention is also being paid to developing multilingual models to tackle the language bottleneck. An increase in size has accompanied the growing need to provide more complex models implementing all these features without being conservative in the number of dimensions required. This paper aims to provide a comprehensive account of the impact of a wide variety of dimensional reduction techniques on the performance of different state-of-the-art multilingual siamese transformers, including unsupervised dimensional reduction techniques such as linear and nonlinear feature extraction, feature selection, and manifold techniques. In order to evaluate the effects of these techniques, we considered the multilingual extended version of Semantic Textual Similarity Benchmark (mSTSb) and two different baseline approaches, one using the embeddings from the pre-trained version of five models and another using their fine-tuned STS version. The results evidence that it is possible to achieve an average reduction of $$91.58\% \pm 2.59\%$$ 91.58 % ± 2.59 % in the number of dimensions of embeddings from pre-trained models requiring a fitting time $$96.68\% \pm 0.68\%$$ 96.68 % ± 0.68 % faster than the fine-tuning process. Besides, we achieve $$54.65\% \pm 32.20\%$$ 54.65 % ± 32.20 % dimensionality reduction in embeddings from fine-tuned models. The results of this study will significantly contribute to the understanding of how different tuning approaches affect performance on semantic-aware tasks and how dimensional reduction techniques deal with the high-dimensional embeddings computed for the STS task and their potential for other highly demanding NLP tasks.

show abstract

Feature Selection for Text Classification Using Machine Learning Approaches

Cited by 20 publications

References 12 publications

An Improved DeepNN with Feature Ranking for Covid-19 Detection

An Improved DeepNN with Feature Ranking for Covid-19 Detection

Hotel Review Classification Based on the Text Pretraining Heterogeneous Graph Neural Network Model

Exploring Dimensionality Reduction Techniques in Multilingual Transformers

Contact Info

Product

Resources

About