A Case Study in Text Mining of Discussion Forum Posts: Classification with Bag of Words and Global Vectors

Cichosz, Paweł

doi:10.2478/amcs-2018-0060

Cited by 18 publications

(22 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…RF is a popular ensemble modelling algorithm that achieves excellent predictive performance by combining multiple models from the same domain [26], An RF is represented by a set of unpruned DTs that are grown based on multiple bootstrap samples that are drawn (with replacements) from the training set via randomised split selection. RF is a rapid and accurate technique employed for document categorisation and text classification.…”

Section: Random Forestmentioning

confidence: 99%

Classification of a COVID-19 dataset by using labels created from clustering algorithms

Rafea

Ahmed

Abdullah

2021

IJEECS

View full text Add to dashboard Cite

<span>Novel coronavirus (COVID-19) is a newly discovered infectious disease that has received much attention in the literature because of its rapid spread and daily global deaths attributable to such disease. The White House, together with a coalition of leading research groups, has published the freely available COVID-19 Open Research Dataset to help the global research community apply the recent advances in natural language processing and other AI techniques in generating novel insights that can support the ongoing fight against this disease. In this paper, the hierarchical and k-means clustering techniques are used to create a tool for identifying similar articles on COVID-19 and filtering them based on their titles. These articles are classified by applying three data mining techniques, namely, random forest (RF), decision tree (DT) and bagging. By using this tool, specialists can limit the number of articles they need to study and pre-process these articles via data framing, tokenisation, normalisation and term frequency-inverse document frequency. Given its 2D nature, the dimensionality of this dataset is reduced by applying t-SNE. The aforementioned data mining techniques are then cross validated to test the accuracy, precision and recall performance of the proposed tool. Results show that the proposed tool effectively extracts the keywords for each cluster, with RF, DT and bagging achieving optimal accuracies of 98.267%, 97.633% and 97.833%, respectively.</span>

show abstract

Section: Random Forestmentioning

confidence: 99%

Classification of a COVID-19 dataset by using labels created from clustering algorithms

Rafea

Ahmed

Abdullah

2021

IJEECS

View full text Add to dashboard Cite

show abstract

“…Many machine learning algorithms, including logistic regression (LR), naïve Bayes (NB), support vector machine (SVM), K-nearest neighbor (KNN) and ensemble classifiers (such as bagging and random forest (RF)), have been widely used in text classification studies (Sebastiani, 2002; Liu et al , 2017; Sharmin and Zaman, 2017; Cichosz, 2018; Gravanis et al , 2019). For example, a total of 2000 teachers' posts were collected and coded for constructing six-class classification models based on NB and SVM to classify the teachers' reflective thinking in the online learning environment (Liu et al , 2017).…”

Section: Literature Reviewmentioning

confidence: 99%

Analyzing online discussion data for understanding the student's critical thinking

Yang

Hung

et al. 2021

DTA

View full text Add to dashboard Cite

PurposeCritical thinking is considered important in psychological science because it enables students to make effective decisions and optimizes their performance. Aiming at the challenges and issues of understanding the student's critical thinking, the objective of this study is to analyze online discussion data through an advanced multi-feature fusion modeling (MFFM) approach for automatically and accurately understanding the student's critical thinking levels.Design/methodology/approachAn advanced MFFM approach is proposed in this study. Specifically, with considering the time-series characteristic and the high correlations between adjacent words in discussion contents, the long short-term memory–convolutional neural network (LSTM-CNN) architecture is proposed to extract deep semantic features, and then these semantic features are combined with linguistic and psychological knowledge generated by the LIWC2015 tool as the inputs of full-connected layers to automatically and accurately predict students' critical thinking levels that are hidden in online discussion data.FindingsA series of experiments with 94 students' 7,691 posts were conducted to verify the effectiveness of the proposed approach. The experimental results show that the proposed MFFM approach that combines two types of textual features outperforms baseline methods, and the semantic-based padding can further improve the prediction performance of MFFM. It can achieve 0.8205 overall accuracy and 0.6172 F1 score for the “high” category on the validation dataset. Furthermore, it is found that the semantic features extracted by LSTM-CNN are more powerful for identifying self-introduction or off-topic discussions, while the linguistic, as well as psychological features, can better distinguish the discussion posts with the highest critical thinking level.Originality/valueWith the support of the proposed MFFM approach, online teachers can conveniently and effectively understand the interaction quality of online discussions, which can support instructional decision-making to better promote the student's knowledge construction process and improve learning performance.

show abstract

“…The third type is known as content analysis approach where tweet text is used to detect spam content. The analysis of text start by Bag-of-Words analysis, a popular approach to identify the k-top words in user groups [8]. Alternatively, studies use n-gram character features, unsupervised learning such as LDA and ensemble approach [9].…”

Section: Related Workmentioning

confidence: 99%

Clustering as feature selection method in spam classification: uncovering sick-leave sellers

Elhussein

Brahimi

2021

ACI

View full text Add to dashboard Cite

PurposeThis paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter.Design/methodology/approachFour machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied.FindingsRadom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore, using clustering as a feature selection method improved the Sensitivity of the model from 73.83% to 98.79%. Sensitivity (recall) is the most important measure of classifier performance when detecting promoters’ accounts that have spam-like behavior.Research limitations/implicationsThe method applied is novel, more testing is needed in other datasets before generalizing its results.Practical implicationsThe model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online.Originality/valueThe research is proposing a new way textual clustering can be used in feature selection.

show abstract

A Case Study in Text Mining of Discussion Forum Posts: Classification with Bag of Words and Global Vectors

Cited by 18 publications

References 42 publications

Classification of a COVID-19 dataset by using labels created from clustering algorithms

Classification of a COVID-19 dataset by using labels created from clustering algorithms

Analyzing online discussion data for understanding the student's critical thinking

Clustering as feature selection method in spam classification: uncovering sick-leave sellers

Contact Info

Product

Resources

About