A tweet grouping methodology utilizing inter and intra cosine similarity

Kaur, Navneet; Gelowitz, Craig M.

doi:10.1109/ccece.2015.7129370

Cited by 7 publications

(4 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Other clustering algorithms, such as hierarchical or agglomerative, were tested for similar tasks by Kaur (2015) [24] and Miyamoto et al (2012) [25]. However, these approaches counted occurrences of specific nouns or word sequences.…”

Section: Related Workmentioning

confidence: 99%

Natural Language Processing-based Method for Clustering and Analysis of Movie Reviews and Classification by Genre

González,

Torres-Ruiz,

Rivera-Torruco

et al. 2023

Preprint

View full text Add to dashboard Cite

The large quantity of information retrieved from communities, public data repositories, web pages, or data mining can be sparsed and poorly classified. This work shows how to employ unsupervised classification algorithms such as K-means proper to classify user reviews into their closest category, forming a balanced data set. Moreover, we found that the text vectorization technique significantly impacts the clustering formation, comparing TF-IDF and Word2Vec. The value for mapping a cluster with movie genre was 81.34% ± 20.48 of the cases when the TF-IDF was applied, whereas Word2Vec only yielded a 53.51% ± 24.1. In addition, we highlight the impact of the removal of stop-words. Thus, we detected that pre-compiled lists are not the best method to remove stop-words before clustering because there is much ambiguity, centroids are poorly separated, and only 57% of clusters could match a movie genre. Thus, our proposed approach achieved a 94% of accuracy. After analyzing the classifiers’ results, we appreciated a similar effect when divided by the stop-words method removal. Statistically significant changes were observed, especially in precision metric and Jaccard scores in both classifiers, using custom-generated stop lists rather than pre-compiled ones. Reclassifying sparse data is strongly recommended as using custom-generated stop lists.

show abstract

Section: Related Workmentioning

confidence: 99%

Natural Language Processing-based Method for Clustering and Analysis of Movie Reviews and Classification by Genre

González,

Torres-Ruiz,

Rivera-Torruco

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…www.business-inform.net бору даних твітів, щоб побачити, як вибір функції відстані впливає на поведінку алгоритмів ієрархічної кластеризації. Для динамічного створення широких категорій подібних твітів на основі появи іменників запропоновано інтегрований ієрархічний підхід агломеративної та подільної кластеризації [16].…”

Section: (477) Jel: L86 Tumanov O O Cluster-analyzing the Use And Sunclassified

Cluster-Analyzing the Use and Spread of Internet Technologies in the Regions of Ukraine

Tumanov¹

2020

View full text Add to dashboard Cite

Туманов О. О. Кластерний аналіз використання та розповсюдження Інтернет-технологій у регіонах України За останні десятиліття розвиток і розповсюдження Інтернет-технологій набули величезних обертів. Використання мобільного Інтернету значно прискорило цей процес. Людям більше не потрібно залишатися вдома або в офісі, щоб перебувати в Інтернет-мережі, а деякі навіть повністю перенесли свою роботу в онлайн-середовище. Одними з важливих елементів цього середовища є соціальні мережі, блоги та інші засоби масової інформації. Соціальні медіа швидко набули популярності, оскільки дають можливість людям спілкуватися та ділитися думками. Велике значення має автоматизований аналіз даних для отримання значущої інформації, яка необхідна потенційному бізнесу, користувачам і споживачам. Для того, щоб краще вивчити використання соціальних медіа, спочатку потрібно зосередитися на загальному підході та знайти надійні показники. Ці показники можуть бути даними інформаційно-комунікаційних технологій (ІКТ), які тепер впливають на кожен аспект життя людини. Вони відіграють значну роль на робочому місці, у бізнесі, освіті та розвагах. Дана стаття включає огляд алгоритмів загальних методів кластеризації та посилання на дослідження, зроблені за останні роки, які використовували відповідні алгоритми: 1) на основі поділів; 2) на основі ієрархії; 3) на гібридній основі та 4) на основі щільності. Досліджено використання та розповсюдження Інтернет-технологій у регіонах Україні. Інформаційною базою дослідження є показники наявної ІКТ-інфраструктури в областях України у 2018 р. На основі даних використання Інтернету в регіонах України проведено кластерний аналіз та надано візуалізацію розподілів на отримані групи.

show abstract

“…These fields have numerous publications, however there is little research on summarization and representation of data on Twitter [3] [4] [5]. The publications about tweet clustering either miss semantic relations between tweets or are not suitable for big data.…”

Section: Motivationmentioning

confidence: 99%

Tweets on a tree: Index-based clustering of tweets

Erpam¹

2017

View full text Add to dashboard Cite

Computer-mediated communication, CMC, is a type of communication that occurs through use of two or more electronic devices. With the advancement of technology, CMC has started to become a more preferred type of communication between humans. Through computer-mediated technologies, news portals, search engines and social media platforms such as Facebook, Twitter, Reddit and many other platforms are created. In social media platforms, a user can post and discuss his/her own opinion and also read and share other users' opinions. This generates a significant amount of data which, if filtered and analyzed, can give researchers important insights about public opinion and culture.Twitter is a social networking service founded in 2006 and became widespread throughout the world in a very short time frame. The service has more than 310 million monthly active users and throughout these users more than 500 million tweets are generated daily as of 2016. Due the volume, velocity and variety of Twitter data, it cannot be analyzed by using conventional methods. A clustering or sampling method is necessary to reduce the amount of data for analysis.To cluster documents, in a very broad sense two similarity measures can be used: Lexical similarity and semantic similarity. Lexical similarity looks for syntactic similarity between documents. It is usually computationally light to compute lexical similarity, however for clustering purposes it may not be very accurate as it disregards the semantic value of words. On the other hand, semantic similarity looks for semantic value and relations between words to calculate the similarity and while it is generally more accurate than lexical similarity, it is computationally difficult to calculate semantic similarity.In our work we aim to create computationally light and accurate clustering of short documents which have the characteristics of big data. We propose a hybrid approach of clustering where lexical and semantic similarity is combined together. In our approach, we use string similarity to create clusters and semantic vector representations of words to interactively merge clusters.Keywords: clustering, twitter, summarization, suffix tree, semantic relatedness, data mining $ This technical report is based on the thesis "Tweets on a tree: Index-based clustering of tweets".

show abstract

A tweet grouping methodology utilizing inter and intra cosine similarity

Cited by 7 publications

References 4 publications

Natural Language Processing-based Method for Clustering and Analysis of Movie Reviews and Classification by Genre

Natural Language Processing-based Method for Clustering and Analysis of Movie Reviews and Classification by Genre

Cluster-Analyzing the Use and Spread of Internet Technologies in the Regions of Ukraine

Tweets on a tree: Index-based clustering of tweets

Contact Info

Product

Resources

About