Word2vec convolutional neural networks for classification of news articles and tweets

Jang, Beakcheol; Kim, In-Hwan; Kim, Jong Wook

doi:10.1371/journal.pone.0220976

Cited by 139 publications

(70 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The SG model was considered in this study based on its suitability for small- to medium-sized datasets. Jang et al [ 36 ] stated that the SG model is advantageous over CBOW when data size is not too large.…”

Section: Methodsmentioning

confidence: 99%

Tweets Classification on the Base of Sentiments for US Airline Companies

Rustam

Ashraf

Mehmood

et al. 2019

Entropy

143

View full text Add to dashboard Cite

The use of data from social networks such as Twitter has been increased during the last few years to improve political campaigns, quality of products and services, sentiment analysis, etc. Tweets classification based on user sentiments is a collaborative and important task for many organizations. This paper proposes a voting classifier (VC) to help sentiment analysis for such organizations. The VC is based on logistic regression (LR) and stochastic gradient descent classifier (SGDC) and uses a soft voting mechanism to make the final prediction. Tweets were classified into positive, negative and neutral classes based on the sentiments they contain. In addition, a variety of machine learning classifiers were evaluated using accuracy, precision, recall and F1 score as the performance metrics. The impact of feature extraction techniques, including term frequency (TF), term frequency-inverse document frequency (TF-IDF), and word2vec, on classification accuracy was investigated as well. Moreover, the performance of a deep long short-term memory (LSTM) network was analyzed on the selected dataset. The results show that the proposed VC performs better than that of other classifiers. The VC is able to achieve an accuracy of 0.789, and 0.791 with TF and TF-IDF feature extraction, respectively. The results demonstrate that ensemble classifiers achieve higher accuracy than non-ensemble classifiers. Experiments further proved that the performance of machine learning classifiers is better when TF-IDF is used as the feature extraction method. Word2vec feature extraction performs worse than TF and TF-IDF feature extraction. The LSTM achieves a lower accuracy than machine learning classifiers.

show abstract

Section: Methodsmentioning

confidence: 99%

Tweets Classification on the Base of Sentiments for US Airline Companies

Rustam

Ashraf

Mehmood

et al. 2019

Entropy

143

View full text Add to dashboard Cite

show abstract

“…On the other hand, many Natural Language Processing (NLP) studies with deep learning models have included learning word vector representation. The word vectors are represented in a dense form known as word embedding, in which the words that are semantically and syntactically related are close to each other in the embedding space [13,[38][39][40]. Word embedding has been used efficiently in many NLP tasks [41].…”

Section: Related Workmentioning

confidence: 99%

Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec

2020

View full text Add to dashboard Cite

The assessment of examination questions is crucial in educational institutes since examination is one of the most common methods to evaluate students' achievement in specific course. Therefore, there is a crucial need to construct a balanced and high-quality exam, which satisfies different cognitive levels. Thus, many lecturers rely on Bloom's taxonomy cognitive domain, which is a popular framework developed for the purpose of assessing students' intellectual abilities and skills. Several works have been proposed to automatically handle the classification of questions in accordance with Bloom's taxonomy. Most of these works classify questions according to specific domain. As a result, there is a lack of technique of classifying questions that belong to the multi-domain areas. The aim of this paper is to present a classification model to classify exam questions based on Bloom's taxonomy that belong to several areas. This study proposes a method for classifying questions automatically, by extracting two features, TFPOS-IDF and word2vec. The purpose of the first feature was to calculate the term frequency-inverse document frequency based on part of speech, in order to assign a suitable weight for essential words in the question. The second feature, pre-trained word2vec, was used to boost the classification process. Then, the combination of these features was fed into three different classifiers; K-Nearest Neighbour, Logistic Regression, and Support Vector Machine, in order to classify the questions. The experiments used two datasets. The first dataset contained 141 questions, while the other dataset contained 600 questions. The classification result for the first dataset achieved an average of 71.1%, 82.3% and 83.7% weighted F1-measure respectively. The classification result for the second dataset achieved an average of 85.4%, 89.4% and 89.7% weighted F1-measure respectively. The finding from this study showed that the proposed method is significant in classifying questions from multiple domains based on Bloom's taxonomy.

show abstract

“…As previously mentioned, it was proven by Mikolov et al [63] that the performance of supervised methods, which rely on gold annotated treebanks, decays dramatically when applied to other domains or other languages. Mikolov et al [63] also mentioned that the generation of distributed representations of textual units would be adopted in NLP owing to the improvements they provide to different NLP tasks (e.g., the text classification task by Jang et al [64]).…”

Section: Plos Onementioning

confidence: 99%

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

2020

View full text Add to dashboard Cite

Symbolic sequential data are produced in huge quantities in numerous contexts, such as text and speech data, biometrics, genomics, financial market indexes, music sheets, and online social media posts. In this paper, an unsupervised approach for the chunking of idiomatic units of sequential text data is presented. Text chunking refers to the task of splitting a string of textual information into non-overlapping groups of related units. This is a fundamental problem in numerous fields where understanding the relation between raw units of symbolic sequential data is relevant. Existing methods are based primarily on supervised and semi-supervised learning approaches; however, in this study, a novel unsupervised approach is proposed based on the existing concept of n-grams, which requires no labeled text as an input. The proposed methodology is applied to two natural language corpora: a Wall Street Journal corpus and a Twitter corpus. In both cases, the corpus length was increased gradually to measure the accuracy with a different number of unitary elements as inputs. Both corpora reveal improvements in accuracy proportional with increases in the number of tokens. For the Twitter corpus, the increase in accuracy follows a linear trend. The results show that the proposed methodology can achieve a higher accuracy with incremental usage. A future study will aim at designing an iterative system for the proposed methodology.

show abstract

Word2vec convolutional neural networks for classification of news articles and tweets

Cited by 139 publications

References 48 publications

Tweets Classification on the Base of Sentiments for US Airline Companies

Tweets Classification on the Base of Sentiments for US Airline Companies

Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec

Unsupervised acquisition of idiomatic units of symbolic natural language: An n-gram frequency-based approach for the chunking of news articles and tweets

Contact Info

Product

Resources

About