The influence of preprocessing on text classification using a bag-of-words representation

HaCohen-Kerner, Yaakov; Miller, Delbert C.; Yigal, Yair

doi:10.1371/journal.pone.0232525

Cited by 162 publications

(103 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In this context, the authors in [40] observed a decrease in performances of SVM classification models from 70.76% to 55.26% for the task of automatic annotation of clinical text fragments based on codebooks having a large number of categories. Similarly, authors in [41], [42] also reported the underperformance of the employed classification models for text classification as a consequence of removing the stopwords. In the case of the DL models, we have used BiLSTM layers, which handle the long-term dependencies and have the capability to store information for a long duration.…”

Section: ) Stopwords and Their Impact In Text Prepocessingmentioning

confidence: 88%

Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification From Clinical Notes

et al. 2021

View full text Add to dashboard Cite

Section: ) Stopwords and Their Impact In Text Prepocessingmentioning

confidence: 88%

Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification From Clinical Notes

et al. 2021

View full text Add to dashboard Cite

“…Consequently, it is needed to preprocess the tweets before analyzing them so that all the irrelevant attributes are removed from the datasets to avoid the contradiction of results. In this research, we have preprocessed all the datasets equally at multiple stages as described in the literature (HaCohen-Kerner, Miller & Yigal, 2020) and got improved results. Text pre-processing includes data cleansing by removing the unrelated data, including URLs, stop words, smilies, slang, redundant data, and all other irrelevant material.…”

Section: Data Scrubbing and Transformationmentioning

confidence: 99%

Multi-level aspect based sentiment classification of Twitter data: using hybrid approach in deep learning

Janjua¹,

Siddiqui²,

Sindhu³

et al. 2021

PeerJ Computer Science

View full text Add to dashboard Cite

Social media is a vital source to produce textual data, further utilized in various research fields. It has been considered an essential foundation for organizations to get valuable data to assess the users’ thoughts and opinions on a specific topic. Text classification is a procedure to assign tags to predefined classes automatically based on their contents. The aspect-based sentiment analysis to classify the text is challenging. Every work related to sentiment analysis approached this issue as the current research usually discusses the document-level and overall sentence-level analysis rather than the particularities of the sentiments. This research aims to use Twitter data to perform a finer-grained sentiment analysis at aspect-level by considering explicit and implicit aspects. This study proposes a new Multi-level Hybrid Aspect-Based Sentiment Classification (MuLeHyABSC) approach by embedding a feature ranking process with an amendment of feature selection method for Twitter and sentiment classification comprising of Artificial Neural Network; Multi-Layer Perceptron (MLP) is used to attain improved results. In this study, different machine learning classification methods were also implemented, including Random Forest (RF), Support Vector Classifier (SVC), and seven more classifiers to compare with the proposed classification method. The implementation of the proposed hybrid method has shown better performance and the efficiency of the proposed system was validated on multiple Twitter datasets to manifest different domains. We achieved better results for all Twitter datasets used for the validation purpose of the proposed method with an accuracy of 78.99%, 84.09%, 80.38%, 82.37%, and 84.72%, respectively, compared to the baseline approaches. The proposed approach revealed that the new hybrid aspect-based text classification functionality is enhanced, and it outperformed the existing baseline methods for sentiment classification.

show abstract

“…To model a mass spectrum using LLDA, it is necessary to represent a mass spectrum as a bag-of-words "document" 23 . First, any fragment having a mass-to-charge-ratio (m/z) below 30 is discarded to remove structurally uninformative fragments.…”

Section: Data Preprocessingmentioning

confidence: 99%

Supervised topic modeling for predicting molecular substructure from mass spectrometry

et al. 2021

View full text Add to dashboard Cite

Small-molecule metabolites are principal actors in myriad phenomena across biochemistry and serve as an important source of biomarkers and drug candidates. Given a sample of unknown composition, identifying the metabolites present is difficult given the large number of small molecules both known and yet to be discovered. Even for biofluids such as human blood, building reliable ways of identifying biomarkers is challenging. A workhorse method for characterizing individual molecules in such untargeted metabolomics studies is tandem mass spectrometry (MS/MS). MS/MS spectra provide rich information about chemical composition. However, structural characterization from spectra corresponding to unknown molecules remains a bottleneck in metabolomics. Current methods often rely on matching to pre-existing databases in one form or another. Here we develop a preprocessing scheme and supervised topic modeling approach to identify modular groups of spectrum fragments and neutral losses corresponding to chemical substructures using labeled latent Dirichlet allocation (LLDA) to map spectrum features to known chemical structures. These structures appear in new unknown spectra and can be predicted. We find that LLDA is an interpretable and reliable method for structure prediction from MS/MS spectra. Specifically, the LLDA approach has the following advantages: (a) molecular topics are interpretable; (b) A practitioner can select any set of chemical structure labels relevant to their problem; (c ) LLDA performs well and can exceed the performance of other methods in predicting substructures in novel contexts.

show abstract

The influence of preprocessing on text classification using a bag-of-words representation

Cited by 162 publications

References 34 publications

Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification From Clinical Notes

Ensembling Classical Machine Learning and Deep Learning Approaches for Morbidity Identification From Clinical Notes

Multi-level aspect based sentiment classification of Twitter data: using hybrid approach in deep learning

Supervised topic modeling for predicting molecular substructure from mass spectrometry

Contact Info

Product

Resources

About