Text documents clustering using data mining techniques

Jalal, Ahmed Adeeb; Ali, Basheer Husham

doi:10.11591/ijece.v11i1.pp664-670

Cited by 33 publications

(14 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The second step is to remove stop words that don't make sense. The final step separates the roots of the word lemmatization, allowing the processing of words that appear different but have the same root as a single form [43]. After this stage, the texts are ready for the feature extraction process.…”

Section: Pre-processingmentioning

confidence: 99%

A Multiclass Approach to Estimating Software Vulnerability Severity Rating with Statistical and Word Embedding Methods

KEKÜL¹,

Ergen²,

Arslan³

2022

IJCNIS

View full text Add to dashboard Cite

The analysis and grading of software vulnerabilities is an important process that is done manually by experts today. For this reason, there are time delays, human errors, and excessive costs involved with the process. The final result of these software vulnerability reports created by experts is the calculation of a severity score and a severity rating. The severity rating is the first and foremost value of the software’s vulnerability. The vulnerabilities that can be exploited are only 20% of the total vulnerabilities. The vast majority of exploitations take place within the first two weeks. It is therefore imperative to determine the severity rating without time delays. Our proposed model uses statistical methods and deep learning-based word embedding methods from natural language processing techniques, and machine learning algorithms that perform multi-class classification. Bag of Words, Term Frequency Inverse Document Frequency and Ngram methods, which are statistical methods, were used for feature extraction. Word2Vec, Doc2Vec and Fasttext algorithms are included in the study for deep learning based Word embedding. In the classification stage, Naive Bayes, Decision Tree, K-Nearest Neighbors, Multi-Layer Perceptron, and Random Forest algorithms that can make multi-class classification were preferred. With this aspect, our model proposes a hybrid method. The database used is open to the public and is the most reliable data set in the field. The results obtained in our study are quite promising. By helping experts in this field, procedures will speed up. In addition, our study is one of the first studies containing the latest version of the data size and scoring systems it covers.

show abstract

Section: Pre-processingmentioning

confidence: 99%

A Multiclass Approach to Estimating Software Vulnerability Severity Rating with Statistical and Word Embedding Methods

KEKÜL¹,

Ergen²,

Arslan³

2022

IJCNIS

View full text Add to dashboard Cite

show abstract

“…Data mining is the process by which useful information is collected from large amounts of data. Data mining techniques have been used to solve a variety of reallife problems like clustering [1]. In clustering categorizing a population N data points into K subgroups so that data points in one group are more similar to data points in other groups.…”

Section: Introductionmentioning

confidence: 99%

A Novel Hybrid Clustering Approach Based on Black Hole Algorithm for Document Clustering

Malik

Khan

2022

IEEE Access

View full text Add to dashboard Cite

In information retrieval and text mining, document clustering is a big challenge because the amount of document collection has been increasing, day by day. The problem of clustering is NP-hard, use of meta-heuristic algorithms to solve these problems could be an effective method. When the solution space is large, traditional methods are unable to find a solution in a reasonable amount of time. K-means is a heuristic clustering algorithm, two main issues with heuristic algorithms are the early convergence and trapping in local optima. Moreover, finding the right number of clusters is one of the main drawbacks of the k-means algorithm. The correct value of k is always confusing, different researchers used different methods to solve this problem. To overcome these mentioned problems, this study presents a novel Hybrid approach for document clustering. One of the challenges in existing BH algorithm is the input data type. Recently, the algorithm was only accepting textual data. Another flaw in the existing model is that it doesn't choose how many clusters k to form automatically, and the centroids are chosen at random in it. In this paper, we have constructed a Hybrid cluster identification approach which consists of the Elbow method and Silhouette score for cluster k identification. This paper mainly offers three novel combination of model to represent text documents, namely i) K-mean++ -BH + TF-IDF with fix k ii) K-mean++ -BH + W2V with fix k iii) Hybrid Black Hole with automated k. The proposed improvements have validated on the document clustering problem. Cluster analysis based on two evaluation measures, external (Purity) and internal measures (Silhouette score) are used to report the findings. Experiments have been carried out on the four al-phanumeric datasets (Doc50, Reuters, WebKB and News20) as well as on two numeric datasets (Iris and Wine) respectively. The complete result analysis is reported in detail with respect to each research contribution to compare the performance of the proposed algorithm with existing clustering methods. Result shows that the proposed Hybrid BH algorithm outperforms better than the existing clustering methods for all datasets. The clustering of data with and without stop words is examined; additionally, the two alternative word embedding used for data exploration in conjunction with proposed model are also evaluated. In the present study, proposed Hybrid BH algorithm handles the optimal value of k efficiently. This is one of the major contributions of the paper, concluded that Hybrid Black Hole is an effective algorithm for cluster analysis.

show abstract

“…Subeno et al [7] aimed to determine the optimal number of corpus topics in the LDA method. The proposed approach in [8] can cluster the text documents of research papers into meaningful categories which contain a similar scientific field using a title, abstract, and keywords of the paper to the categories topics. Chauhan and Shah [9] introduced the preliminaries of the topic modeling techniques and reviewed its extensions and variations.…”

Section: Introductionmentioning

confidence: 99%

A text mining and topic modeling based bibliometric exploration of information science research

Silwattananusarn

Kulkanjanapiban

2022

IJ-AI

View full text Add to dashboard Cite

This study investigates the evolution of information science research based on bibliometric analysis and semantic mining. The study discusses the value and application of metadata tagging and topic modeling. Forty-two thousand seven hundred thirty-eight articles were extracted from Clarivate Analytic's Web of Science Core Collection 2010-2020. This study was divided into two phases. Firstly, bibliometric analyzes were performed with VOSviewer. Secondly, the topic identification and evolution trends of information science research were conducted through the topic modeling approach latent dirichlet allocation (LDA) is often used to extract themes from a corpus, and the topic model was a representation of a collection of documents that is simplified using topic-modeling-toolkit (TMT). The top 10 core topics (tags) were information research design, information health-based, model data public, study information studies, analysis effect implications, knowledge support web, data research, social research study, study media information, and research impact time for the studied period. Not only does topic modeling assist in identifying popular topics or related areas within a researcher's area, but it may be used to discover emerging topics or areas of study throughout time.

show abstract

Text documents clustering using data mining techniques

Cited by 33 publications

References 25 publications

A Multiclass Approach to Estimating Software Vulnerability Severity Rating with Statistical and Word Embedding Methods

A Multiclass Approach to Estimating Software Vulnerability Severity Rating with Statistical and Word Embedding Methods

A Novel Hybrid Clustering Approach Based on Black Hole Algorithm for Document Clustering

A text mining and topic modeling based bibliometric exploration of information science research

Contact Info

Product

Resources

About