Sentiment Classification of Crowdsourcing Participants’ Reviews Text Based on LDA Topic Model

Huang, Yanrong; Wang, Rui; Huang, Bin; Wei, Bo; Zheng, Shu Li; Chen, Min

doi:10.1109/access.2021.3101565

Cited by 24 publications

(16 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Proposed by Blei et al [ 19 ], LDA is a typical “bag of words” model that treats each text as a vocabulary frequency vector and as a collection of multiple sets of vocabularies. In addition, each group of vocabularies represents a topic, and text topics are extracted without considering the order of and relevance between the vocabularies [ 37 , 38 ]. Normally, an LDA builds its topic generation model through the following steps: (1) a topic is selected from the various topics in a text; (2) a vocabulary is chosen from the list of vocabularies corresponding to the topic selected; and (3) the process is repeated until all of the vocabulary in the text has been selected.…”

Section: Methodsmentioning

confidence: 99%

“…MARS is a multivariate, nonparametric regression technique and a tool that accumulates several basis functions to explain nonlinear states [ 57 ]. Once objective variables are set and a set that contains selectable predictor variables is given, MARS can automate the entire model construction process, including separating meaningful and less appropriate variables, determining the interactions between predictor variables, dealing with the missing value problem by using variable clustering techniques, and avoiding overfitting by using numerous self-tests [ 38 , 58 ].…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Predicting the Mortality of ICU Patients by Topic Model with Machine-Learning Techniques

Chiu

Chien

et al. 2022

Healthcare

View full text Add to dashboard Cite

Predicting clinical patients’ vital signs is a leading critical issue in intensive care units (ICUs) related studies. Early prediction of the mortality of ICU patients can reduce the overall mortality and cost of complication treatment. Some studies have predicted mortality based on electronic health record (EHR) data by using machine learning models. However, the semi-structured data (i.e., patients’ diagnosis data and inspection reports) is rarely used in these models. This study utilized data from the Medical Information Mart for Intensive Care III. We used a Latent Dirichlet Allocation (LDA) model to classify text in the semi-structured data of some particular topics and established and compared the classification and regression trees (CART), logistic regression (LR), multivariate adaptive regression splines (MARS), random forest (RF), and gradient boosting (GB). A total of 46,520 ICU Patients were included, with 11.5% mortality in the Medical Information Mart for Intensive Care III group. Our results revealed that the semi-structured data (diagnosis data and inspection reports) of ICU patients contain useful information that can assist clinical doctors in making critical clinical decisions. In addition, in our comparison of five machine learning models (CART, LR, MARS, RF, and GB), the GB model showed the best performance with the highest area under the receiver operating characteristic curve (AUROC) (0.9280), specificity (93.16%), and sensitivity (83.25%). The RF, LR, and MARS models showed better performance (AUROC are 0.9096, 0.8987, and 0.8935, respectively) than the CART (0.8511). The GB model showed better performance than other machine learning models (CART, LR, MARS, and RF) in predicting the mortality of patients in the intensive care unit. The analysis results could be used to develop a clinically useful decision support system.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Predicting the Mortality of ICU Patients by Topic Model with Machine-Learning Techniques

Chiu

Chien

et al. 2022

Healthcare

View full text Add to dashboard Cite

show abstract

“…The OCR technique is used for preprocessing to extract the text contained in images [21]. In addition, the latent Dirichlet allocation technique, LDA is used to extract topic words from the text [22][23][24][25]. Each extracted topic word is converted into an embedding vector using a pretrained word-embedding model.…”

Section: Sub-model Based On Topic (Topic Sub-model)mentioning

confidence: 99%

Hybrid Features by Combining Visual and Text Information to Improve Spam Filtering Performance

et al. 2022

View full text Add to dashboard Cite

The development of information and communication technology has created many positive outcomes, including convenience for people; however, cases of unsolicited communication, such as spam, also occur frequently. Spam is the indiscriminate transmission of unwanted information by anonymous users, called spammers. Spam content is indiscriminately transmitted to users in various forms, such as SMS, e-mail, and social network service posts, causing negative experiences for users of the service, while also creating costs, such as unnecessarily large amounts of network traffic. In addition, spam content includes phishing, hype or false advertising, and illegal content. Recently, spammers have also used images that contain stimulating content to effectively attract users’ curiosity and attention. Image spam contains more complex information than text, making it more difficult to analyze and to generalize its properties compared to text. Therefore, existing text-based spam detectors are vulnerable to spam image attacks, resulting in a decline in service quality. In this paper, a “hybrid features by combining visual and text information to improve spam filtering performance” method is proposed to reduce the occurrence of misclassification. The proposed method employs three sub-models to extract features from spam images and a classifier model to output the results using the features. Each sub-model extracts topic-, word-, and image-embedding-based features from spam images. In addition, the sub-models use optical character recognition, latent Dirichlet allocation, and word2Vec techniques to extract features from images. To evaluate spam image classification performance, the spam classifiers were trained using the extracted features and the results were measured using a confusion matrix. Our model achieved an accuracy of 0.9814 and a macro-F1 score of 0.9813. In addition, the application of OCR evasion techniques resulted in a decrease in recognition performance. Using the proposed model, a mean macro-F1 score of 0.9607 was obtained.

show abstract

“…The conclusion shows that the combination of text encoder TF-IDF and support vector machine classifier with linear kernel achieved the best performance results. Huang et al [11] constructed a text classifier based on a support vector machine (SVM), random forest (RF), XGBoost, and GBDT algorithm and analyzed the comment text of crowdsourcing platform participants. The results showed that the accuracy of the GBDT text emotion classifier was better than the method.…”

Section: Related Researchmentioning

confidence: 99%

“…A large number of consumer reviews are generated on the online ordering platform. Text reviews contain rich semantic content, such as consumers' experiences, feelings, and preferences, which are important data for feedback on the food safety of online ordering [10,11]. At present, review text mining has been widely used in the commercial field to improve the quality of products and services.…”

Section: Introductionmentioning

confidence: 99%

Analysis and Recognition of Food Safety Problems in Online Ordering Based on Reviews Text Mining

Huang

Wang

et al. 2022

Wireless Communications and Mobile Computing

Self Cite

View full text Add to dashboard Cite

In the era of big data, the online ordering form of “Internet + traditional catering” has adapted to the needs of consumers with a fast pace of life and personalized consumption mode and is booming all over the world. However, due to the consumer information asymmetry and the lack of effective supervision, the potential food safety problems are becoming increasingly prominent. This paper comprehensively uses the social network analysis and Latent Dirichlet Allocation method to mine the text data of consumer comments on the online ordering platform and puts forward five food safety problems existing in the online ordering platform. Then, text features are extracted by using Bert, TF-IDF, Word2vec, and N-gram algorithms, and classifiers based on GBDT, XGBoost, LSTM, BiLSTM, CNN, RNN, and CRNN algorithms are cross constructed to identify text reviews with potential food safety hazards. The classifier’s performance is compared and evaluated through ten-fold cross-validation, Friedman test, and confusion matrix. The research results show that the BERT-GBDT classifier has the best performance in accuracy, precision, specificity, and F1 measure value, and stability is the strongest. It has the best distinguish effect on the text of the review with potential food safety hazards.

show abstract

Sentiment Classification of Crowdsourcing Participants’ Reviews Text Based on LDA Topic Model

Cited by 24 publications

References 14 publications

Predicting the Mortality of ICU Patients by Topic Model with Machine-Learning Techniques

Predicting the Mortality of ICU Patients by Topic Model with Machine-Learning Techniques

Hybrid Features by Combining Visual and Text Information to Improve Spam Filtering Performance

Analysis and Recognition of Food Safety Problems in Online Ordering Based on Reviews Text Mining

Contact Info

Product

Resources

About