Document classification using term frequency-inverse document frequency and K-means clustering

Al-Obaydy, Wasseem N. Ibrahem; Hashim, Hala A.; Najm, Yassen AbdelKhaleq; Jalal, Ahmed Adeeb

doi:10.11591/ijeecs.v27.i3.pp1517-1524

Cited by 8 publications

(6 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Porter Stemming algorithm, for instance, employs a series of about 60 rules applied sequentially. Each rule is of the form: (1) where S1 is a suffix to be replaced by S2 if a condition (usually related to the measure of the stem, or m is satisfied. The measure m is calculated as:…”

Section: Analysis and Comparison Of The Existing Text Processing Tech...mentioning

confidence: 99%

“…Al-Obaydy, Hashim, Najm and Jalal propose an innovative approach for categorizing research articles into thematic groups, leveraging Term Frequency-Inverse Document Frequency (TF-IDF) and K-means clustering. The methodology is designed to address the challenges researchers face in navigating the vast corpus of scientific literature, aiming to cluster text documents into meaningful groups that represent similar scientific fields [1]. Shetty and Kallimani introduce an innovative approach leveraging K-Means clustering for extractive text summarization, focusing on preserving semantic richness while eliminating redundancy [2].…”

Section: Introduction (Literary Review)mentioning

confidence: 99%

See 1 more Smart Citation

Exploration of the Thematic Clustering and Collaboration Opportunities in Kazakhstani Research

Biloshchytskyi,

Shamgunova,

Biloshchytska

2024

sjaitu

View full text Add to dashboard Cite

In today's academic environment, the rapid growth of research publications calls for advanced methods to organize and understand the extensive collections of academic work. This study aims to systematically categorize a substantial number of research paper abstracts from Kazakhstani institutions, focusing on identifying key themes and potential interdisciplinary collaboration opportunities. The dataset includes 13,356 abstracts from the Scopus database, covering a wide range of academic fields. The methodology of this research goes beyond traditional hand-done analysis by using advanced text analysis tools to organize the text data efficiently. This initial phase is crucial for summarizing each abstract's core content. The next steps of the analysis use this organized data to find and group similar thematic areas, considering the complex and multi-dimensional nature of academic research topics. The results reveal a diverse array of research themes, highlighting the dynamic academic contributions from Kazakhstan. Significant areas such as environmental science, technological advancements, linguistics, and cultural studies are among the prominent clusters identified. These insights not only provide an overview of current research directions but also highlight the potential for cross-disciplinary partnerships. Moreover, the findings have important implications for decision-makers, scholars, and educational institutions by illuminating key research areas and collaborative possibilities. This thematic overview acts as a guide for shaping research policies, fostering academic connections, and efficiently distributing resources within the scholarly community. Ultimately, this study adds to the academic conversation by offering a way to navigate and utilize the wealth of information in scientific literature, promoting a more collaborative and integrated research environment.

show abstract

Section: Analysis and Comparison Of The Existing Text Processing Tech...mentioning

confidence: 99%

Section: Introduction (Literary Review)mentioning

confidence: 99%

Exploration of the Thematic Clustering and Collaboration Opportunities in Kazakhstani Research

Biloshchytskyi,

Shamgunova,

Biloshchytska

2024

sjaitu

View full text Add to dashboard Cite

show abstract

“…Algoritma K-means memberikan metode sederhana untuk mengeksekusi solusi perkiraan [11]. Ini adalah pengelompokan eksklusif dan salah satu algoritma yang paling banyak digunakan untuk pengelompokan [12]. Algoritma ini sudah banyak digunakan pada penelitian sebelumnya [13].…”

Section: Pendahuluanunclassified

Algoritma K-Means Clustering Penggunaan Bandwidth Internet (Studi Kasus di Pemerintah Daerah Kabupaten Padang Pariaman)

Mubarak,

Defit,

Nurcahyo

2023

Explore. jurnal. sistem. inf. dan. telematika

View full text Add to dashboard Cite

To support government activities, a fast and precise network connection is needed. So it requires a wide network bandwidth. Bandwidth management needs to be done so that the network speed remains stable. This study aims to look at the pattern of bandwidth usage in the Regional Government of Padang Pariaman Regency using K-Means Clustering. The data is taken from the Cacti application, an open-source, web-based network monitoring software. The total extracted datasets used are 32 OPD data (Regional Apparatus Organizations) in the Regional Government of Padang Pariaman Regency in 2022. The available data is then processed to obtain cluster targets by utilizing the data mining concept using the K-Mean Clustering method. Bandwidth usage data grouping in Padang Pariaman Regency uses the Clustering method with the K-Means algorithm with the attributes Name OPD, Inbound Average, Inbound Maximum, Outbound Average, Outbound Maximum used in the process of calculating and dividing data into 3 clusters with high bandwidth usage categories, low, and medium. Calculations are done manually and then tested with RapidMiner software. The results of the manual calculations obtained the same number of cluster members as the calculations with the RapidMiner software.

show abstract

“…The Okapi BM25 model [10] is a popular probabilistic model that uses criteria such as term frequency, document length, and document frequency to compute the relevance score of a document [11], [12]. Reinforcement learning Reinforcement learning is a type of machine learning that involves training an agent to learn by trial and error.…”

Section: Knowledge Graphsmentioning

confidence: 99%

Efficient information retrieval model: overcoming challenges in search engines-an overview

Sharma,

Panda

2023

IJEECS

View full text Add to dashboard Cite

<span>Search engines play a vital role in information retrieval (IR) indexing and processing vast and diverse data, which now encompasses the ever-expanding wealth of multimedia content. However, search engine performance relies on the efficiency and effectiveness of their information retrieval systems (IRS). To enhance search engine performance, there is a need to develop more efficient and accurate IRS that retrieves relevant information quickly and accurately. To address this challenge, various approaches, including inverted indexing, query expansion, and relevance feedback, have been proposed for IR. Although these approaches have shown promising results, but their effectiveness and limitations require a comprehensive examination This research aims to investigate the challenges and opportunities in designing an efficient IRS for search engines and identify key areas for improvement and future research. The study involves a comprehensive literature review on information retrieval impacting academia, industry, healthcare, e-commerce, and other domains. Researchers rely on search engines to access relevant scientific papers, professionals use them to gather market intelligence, and consumers utilize them for product research and decision-making. The findings of this study will contribute to the development of more efficient and effective information retrieval systems, leading to improved search engine performance and user satisfaction.</span>

show abstract

Document classification using term frequency-inverse document frequency and K-means clustering

Cited by 8 publications

References 31 publications

Exploration of the Thematic Clustering and Collaboration Opportunities in Kazakhstani Research

Exploration of the Thematic Clustering and Collaboration Opportunities in Kazakhstani Research

Algoritma K-Means Clustering Penggunaan Bandwidth Internet (Studi Kasus di Pemerintah Daerah Kabupaten Padang Pariaman)

Efficient information retrieval model: overcoming challenges in search engines-an overview

Contact Info

Product

Resources

About