K-means algorithm with a novel distance measure

Abudalfa, Shadi; Mikki, Mohammad A.

doi:10.3906/elk-1010-869

Cited by 10 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Word2vec [34] then mapped the candidate words into the vector space. The K-means algorithm [35] was used to cluster the candidate vectors to obtain positive and negative clustering centers. The candidate words were individually determined positive or negative when the corresponding candidate vectors were close to the positive and negative clustering centers.…”

Section: Data Preprocessingmentioning

confidence: 99%

F_MixBERT: Sentiment Analysis Model using Focal Loss for Imbalanced E-commerce Reviews

2024

KSII TIIS

View full text Add to dashboard Cite

Users' comments after online shopping are critical to product reputation and business improvement. These comments, sometimes known as e-commerce reviews, influence other customers' purchasing decisions. To confront large amounts of e-commerce reviews, automatic analysis based on machine learning and deep learning draws more and more attention. A core task therein is sentiment analysis. However, the e-commerce reviews exhibit the following characteristics: (1) inconsistency between comment content and the star rating; (2) a large number of unlabeled data, i.e., comments without a star rating, and (3) the data imbalance caused by the sparse negative comments. This paper employs Bidirectional Encoder Representation from Transformers (BERT), one of the best natural language processing models, as the base model. According to the above data characteristics, we propose the F_MixBERT framework, to more effectively use inconsistently low-quality and unlabeled data and resolve the problem of data imbalance. In the framework, the proposed MixBERT incorporates the MixMatch approach into BERT's high-dimensional vectors to train the unlabeled and low-quality data with generated pseudo labels. Meanwhile, data imbalance is resolved by Focal loss, which penalizes the contribution of large-scale data and easily-identifiable data to total loss. Comparative experiments demonstrate that the proposed framework outperforms BERT and MixBERT for sentiment analysis of e-commerce comments.

show abstract

Section: Data Preprocessingmentioning

confidence: 99%

F_MixBERT: Sentiment Analysis Model using Focal Loss for Imbalanced E-commerce Reviews

2024

KSII TIIS

View full text Add to dashboard Cite

show abstract

“…(A4) In present time various distance measure is available for clustering and these distance measure groups under Minkowski, L(1), L(2), Inner product, Shannon's entropy, Combination, Intersection and Fidelity family [4][14] [35]. In this section, the paper describes various distance measures under these families [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] [32][33][34][35].…”

Section: Distance Measures Taxonomymentioning

confidence: 99%

“…Distance measures are not only essential to solve the clustering problem, but it is also solved to pattern recognition, classification, retrieval related problems [4], help to the derivation of new distance measure [5], text classification and clustering [6], document content comparison [7], time-series data management [8], uncertain data classification [9] and clustering [l0], bio-cryptic authentication in cloud databases [11], spatial concentration [12], location fingerprinting [13], author profiling [14], combining density [15], heavy aggregation operators [16], analyzing inconsistent information [17], network intrusion anomaly detection [18] for high volume, variety and velocity. The objective of this paper is identifying the best cluster distance measure for cluster creation in the big data mining and this objective is obtained by the six sections.…”

Section: Introductionmentioning

confidence: 99%

An Empirical Perusal of Dis tance Measures for Clustering with Big Data Mining

Pandey¹,

Shukla²

2019

IJEAT

View full text Add to dashboard Cite

The distance measure is the core idea of data mining techniques such as classification, clustering, and statistical analysis and so on. All clustering taxonomies such as partition, hierarchical, density, grid, model, fuzzy and graphs used to distance measures for the data point’s categorization under difference cluster, cluster construction and validation. Big data mining is the advanced concept of data mining respect to the big data dimensions. When traditional clustering algorithm is used under the big data mining the distance measure is needed for scalable under big data mining and support to a huge size dataset, heterogeneous data and sources, and velocity characteristics of the big data. From a theoretically, practically and the existing research perspective, the paper focuses on volume, variety, and velocity big data criterion for identifying a distance measure for the big data mining and recognize how to distance measure works under clustering taxonomy. This study also analyzed all distance measures accuracy with the help of a confusion matrix through clustering.

show abstract

“…It separates a data set into subsets or clusters so that data values in the same cluster have some common characteristics or attributes [2]. It aims to divide the data into groups (clusters) of similar objects [3]. The objects in the same cluster are more identical to each other than to those in other clusters.…”

Section: Introductionmentioning

confidence: 99%

AMF-IDBSCAN: Incremental Density Based Clustering Algorithm Using Adaptive Median Filtering Technique

Chefrour¹,

Souici-Meslati²

2019

IJCAI

View full text Add to dashboard Cite

Density-based spatial clustering of applications with noise (DBSCAN) is a fundamental algorithm for density-based clustering. It can discover clusters of arbitrary shapes and sizes from a large amount of data, which contains noise and outliers. However, it fails to treat large datasets, outperform when new objects are inserted into the existing database, remove noise points or outliers totally and handle the local density variation that exists within the cluster. So, a good clustering method should allow a significant density modification within the cluster and should learn dynamics and large databases. In this paper, an enhancement of the DBSCAN algorithm is proposed based on incremental clustering called AMF-IDBSCAN which builds incrementally the clusters of different shapes and sizes in large datasets and eliminates the presence of noise and outliers. The proposed AMF-IDBSCAN algorithm uses a canopy clustering algorithm for pre-clustering the data sets to decrease the volume of data, applies an incremental DBSCAN for clustering the data points and Adaptive Median Filtering (AMF) technique for post-clustering to reduce the number of outliers by replacing noises by chosen medians. Experiments with AMF-IDBSCAN are performed on the University of California Irvine (UCI) repository UCI data sets. The results show that our algorithm performs better than DBSCAN, IDBSCAN, and DMDBSCAN. Povzetek: V članku je predstavljen nov algoritem AMF-IDBSCAN, izboljšana različica DBSCAN, ki uporablja grozdenje krošenj za zmanjšanje obsega podatkov in tehnike AMF za odpravo hrupa.

show abstract

K-means algorithm with a novel distance measure

Cited by 10 publications

References 11 publications

F_MixBERT: Sentiment Analysis Model using Focal Loss for Imbalanced E-commerce Reviews

F_MixBERT: Sentiment Analysis Model using Focal Loss for Imbalanced E-commerce Reviews

An Empirical Perusal of Dis tance Measures for Clustering with Big Data Mining

AMF-IDBSCAN: Incremental Density Based Clustering Algorithm Using Adaptive Median Filtering Technique

Contact Info

Product

Resources

About