A Two-Stepped Feature Engineering Process for Topic Modeling using Batchwise LDA with Stochastic Variational Inference Model

Kokatnoor, Sujatha Arun; Christ,; Balachandran, K.

doi:10.22266/ijies2020.0831.29

Cited by 3 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1815 Each * provided by the medical professionals specified an extra weightage for the consolidated point mentioned, where * has least weightage and ***** has highest weightage, respectively. Feature Engineering through TF-IDF+forward scan trigrams [5] and removal of weak features through Feature Hashing has helped improve the model's performance by 12% in terms of coherence scores. The coherence score was used in the experimentation process for assessing the quality of the identified topics.…”

Section: And Discussionmentioning

confidence: 99%

“…After extracting the tweets from Twitter, natural language tool kit (NLTK) 3.1 version is used for initial data preprocessing. Then the first level of improvised feature engineering (weighted TF-IDF in combination with Forward Scan Trigrams [5]) is applied to create an efficient VSM. This VSM is input to an enhanced K-means clustering algorithm to yield clusters based on the similarity of the data elements.…”

Section: Methodsmentioning

confidence: 99%

“…These features can be used to enhance machine learning algorithms. In this proposed work, the tweets extracted from users on how there was an increase in new coronavirus cases between 18 th June to 29 th June 2020 for India specific, were converted into an efficient VSM using two steps of feature engineering process [5]. Weighted TF-IDF with forward scan trigrams approach was used in the first step [5] and weak features were removed using improvised feature hashing in the second step.…”

Section: Feature Engineeringmentioning

confidence: 99%

“…Twitter was chosen for the people's opinions as it is one of the popular OSM as per Data Never Sleeps 7.0 report, where people post more than 500000 posts per minute. The collected tweets were pre-processed and improvised feature engineering was applied to create an efficient vector space model (VSM) [5]. The tweets were then clustered using the proposed enhanced Kmeans clustering algorithm to group the dataset into five different clusters.…”

Section: Introductionmentioning

confidence: 99%

“…Feature engineering was applied in two steps. In the first step, forward scan trigrams (FST)-based term frequency and inverse document frequency (TF-IDF) was applied to reduce the high dimensional feature vector into an efficient VSM [5]. In the second step, all the weak features present in the text dataset were removed using the proposed feature hashing method.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Root cause analysis of COVID-19 cases by enhanced text mining process

Kokatnoor

Balachandran

2022

IJECE

Self Cite

View full text Add to dashboard Cite

<p>The main focus of this research is to find the reasons behind the fresh cases of COVID-19 from the public’s perception for data specific to India. The analysis is done using machine learning approaches and validating the inferences with medical professionals. The data processing and analysis is accomplished in three steps. First, the dimensionality of the vector space model (VSM) is reduced with improvised feature engineering (FE) process by using a weighted term frequency-inverse document frequency (TF-IDF) and forward scan trigrams (FST) followed by removal of weak features using feature hashing technique. In the second step, an enhanced K-means clustering algorithm is used for grouping, based on the public posts from Twitter®. In the last step, latent dirichlet allocation (LDA) is applied for discovering the trigram topics relevant to the reasons behind the increase of fresh COVID-19 cases. The enhanced K-means clustering improved Dunn index value by 18.11% when compared with the traditional K-means method. By incorporating improvised two-step FE process, LDA model improved by 14% in terms of coherence score and by 19% and 15% when compared with latent semantic analysis (LSA) and hierarchical dirichlet process (HDP) respectively thereby resulting in 14 root causes for spike in the disease.</p>

show abstract

Section: And Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Feature Engineeringmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Root cause analysis of COVID-19 cases by enhanced text mining process

Kokatnoor

Balachandran

2022

IJECE

Self Cite

View full text Add to dashboard Cite

show abstract

Online English Education Web Page Analysis System on Account of SVM+LDA

2022

2022 International Conference on Knowledge Engineering and Communication Systems (ICKES)

View full text Add to dashboard Cite

Text Mining - A Comparative Review of Twitter Sentiments Analysis

Patil,

Subil,

Nasar

et al. 2024

RACSC

View full text Add to dashboard Cite

Background: Text mining derives information and patterns from textual data. Online social media platforms, which have recently acquired great interest, generate vast text data about human behaviors based on their interactions. This data is generally ambiguous and unstructured. The data includes typing errors and errors in grammar that cause lexical, syntactic, and semantic uncertainties. This results in incorrect pattern detection and analysis. Researchers are employing various text mining techniques that can aid in Topic Modeling, the detection of Trending Topics, the identification of Hate Speeches, and the growth of communities in online social media networks. Objective: This review paper compares the performance of ten machine learning classification techniques on a Twitter data set for analyzing users' sentiments on posts related to airline usage. Methods: Review and comparative analysis of Gaussian Naive Bayes, Random Forest, Multinomial Naive Bayes, Multinomial Naive Bayes with Bagging, Adaptive Boosting (AdaBoost), Optimized AdaBoost, Support Vector Machine (SVM), Optimized SVM, Logistic Regression, and Long-Short Term Memory (LSTM) for sentiment analysis. Results: The results of the experimental study showed that the Optimized SVM performed better than the other classifiers, with a training accuracy of 99.73% and testing accuracy of 89.74% compared to other models. Conclusion: Optimized SVM uses the RBF kernel function and nonlinear hyperplanes to split the dataset into classes, correctly classifying the dataset into distinct polarity. This, together with Feature Engineering utilizing Forward Trigrams and Weighted TF-IDF, has improved Optimized SVM classifier performance regarding train and test accuracy. Therefore, the train and test accuracy of Optimized SVM are 99.73% and 89.74% respectively. When compared to Random Forest, a marginal of 0.09% and 1.73% performance enhancement is observed in terms of train and test accuracy and 1.29% (train accuracy) and 3.63% (test accuracy) of improved performance when compared with LSTM. Likewise, Optimized SVM, gave more than 10% of enhanced performance in terms of train accuracy when compared with Gaussian Naïve Bayes, Multinomial Naïve Bayes, Multinomial Naïve Bayes with Bagging, Logistic Regression and a similar enhancement is observed with AdaBoost and Optimized AdaBoost which are ensemble models during the experimental process. Optimized SVM also has outperformed all the classification models in terms of AUC-ROC train and test scores.

show abstract

A Two-Stepped Feature Engineering Process for Topic Modeling using Batchwise LDA with Stochastic Variational Inference Model

Cited by 3 publications

References 0 publications

Root cause analysis of COVID-19 cases by enhanced text mining process

Root cause analysis of COVID-19 cases by enhanced text mining process

Online English Education Web Page Analysis System on Account of SVM+LDA

Text Mining - A Comparative Review of Twitter Sentiments Analysis

Contact Info

Product

Resources

About