Abstract:Abstract:In this paper, we describe an essential problem in data clustering and present some solutions for it. We investigated using distance measures other than Euclidean type for improving the performance of clustering. We also developed an improved point symmetry-based distance measure and proved its efficiency. We developed a k-means algorithm with a novel distance measure that improves the performance of the classical k-means algorithm. The proposed algorithm does not have the worst-case bound on running … Show more
“…Word2vec [34] then mapped the candidate words into the vector space. The K-means algorithm [35] was used to cluster the candidate vectors to obtain positive and negative clustering centers. The candidate words were individually determined positive or negative when the corresponding candidate vectors were close to the positive and negative clustering centers.…”
Users' comments after online shopping are critical to product reputation and business improvement. These comments, sometimes known as e-commerce reviews, influence other customers' purchasing decisions. To confront large amounts of e-commerce reviews, automatic analysis based on machine learning and deep learning draws more and more attention. A core task therein is sentiment analysis. However, the e-commerce reviews exhibit the following characteristics: (1) inconsistency between comment content and the star rating; (2) a large number of unlabeled data, i.e., comments without a star rating, and (3) the data imbalance caused by the sparse negative comments. This paper employs Bidirectional Encoder Representation from Transformers (BERT), one of the best natural language processing models, as the base model. According to the above data characteristics, we propose the F_MixBERT framework, to more effectively use inconsistently low-quality and unlabeled data and resolve the problem of data imbalance. In the framework, the proposed MixBERT incorporates the MixMatch approach into BERT's high-dimensional vectors to train the unlabeled and low-quality data with generated pseudo labels. Meanwhile, data imbalance is resolved by Focal loss, which penalizes the contribution of large-scale data and easily-identifiable data to total loss. Comparative experiments demonstrate that the proposed framework outperforms BERT and MixBERT for sentiment analysis of e-commerce comments.
“…Word2vec [34] then mapped the candidate words into the vector space. The K-means algorithm [35] was used to cluster the candidate vectors to obtain positive and negative clustering centers. The candidate words were individually determined positive or negative when the corresponding candidate vectors were close to the positive and negative clustering centers.…”
Users' comments after online shopping are critical to product reputation and business improvement. These comments, sometimes known as e-commerce reviews, influence other customers' purchasing decisions. To confront large amounts of e-commerce reviews, automatic analysis based on machine learning and deep learning draws more and more attention. A core task therein is sentiment analysis. However, the e-commerce reviews exhibit the following characteristics: (1) inconsistency between comment content and the star rating; (2) a large number of unlabeled data, i.e., comments without a star rating, and (3) the data imbalance caused by the sparse negative comments. This paper employs Bidirectional Encoder Representation from Transformers (BERT), one of the best natural language processing models, as the base model. According to the above data characteristics, we propose the F_MixBERT framework, to more effectively use inconsistently low-quality and unlabeled data and resolve the problem of data imbalance. In the framework, the proposed MixBERT incorporates the MixMatch approach into BERT's high-dimensional vectors to train the unlabeled and low-quality data with generated pseudo labels. Meanwhile, data imbalance is resolved by Focal loss, which penalizes the contribution of large-scale data and easily-identifiable data to total loss. Comparative experiments demonstrate that the proposed framework outperforms BERT and MixBERT for sentiment analysis of e-commerce comments.
“…(A4) In present time various distance measure is available for clustering and these distance measure groups under Minkowski, L(1), L(2), Inner product, Shannon's entropy, Combination, Intersection and Fidelity family [4][14] [35]. In this section, the paper describes various distance measures under these families [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18] [32][33][34][35].…”
Section: Distance Measures Taxonomymentioning
confidence: 99%
“…Distance measures are not only essential to solve the clustering problem, but it is also solved to pattern recognition, classification, retrieval related problems [4], help to the derivation of new distance measure [5], text classification and clustering [6], document content comparison [7], time-series data management [8], uncertain data classification [9] and clustering [l0], bio-cryptic authentication in cloud databases [11], spatial concentration [12], location fingerprinting [13], author profiling [14], combining density [15], heavy aggregation operators [16], analyzing inconsistent information [17], network intrusion anomaly detection [18] for high volume, variety and velocity. The objective of this paper is identifying the best cluster distance measure for cluster creation in the big data mining and this objective is obtained by the six sections.…”
The distance measure is the core idea of data mining techniques such as classification, clustering, and statistical analysis and so on. All clustering taxonomies such as partition, hierarchical, density, grid, model, fuzzy and graphs used to distance measures for the data point’s categorization under difference cluster, cluster construction and validation. Big data mining is the advanced concept of data mining respect to the big data dimensions. When traditional clustering algorithm is used under the big data mining the distance measure is needed for scalable under big data mining and support to a huge size dataset, heterogeneous data and sources, and velocity characteristics of the big data. From a theoretically, practically and the existing research perspective, the paper focuses on volume, variety, and velocity big data criterion for identifying a distance measure for the big data mining and recognize how to distance measure works under clustering taxonomy. This study also analyzed all distance measures accuracy with the help of a confusion matrix through clustering.
“…It separates a data set into subsets or clusters so that data values in the same cluster have some common characteristics or attributes [2]. It aims to divide the data into groups (clusters) of similar objects [3]. The objects in the same cluster are more identical to each other than to those in other clusters.…”
Density-based spatial clustering of applications with noise (DBSCAN) is a fundamental algorithm for density-based clustering. It can discover clusters of arbitrary shapes and sizes from a large amount of data, which contains noise and outliers. However, it fails to treat large datasets, outperform when new objects are inserted into the existing database, remove noise points or outliers totally and handle the local density variation that exists within the cluster. So, a good clustering method should allow a significant density modification within the cluster and should learn dynamics and large databases. In this paper, an enhancement of the DBSCAN algorithm is proposed based on incremental clustering called AMF-IDBSCAN which builds incrementally the clusters of different shapes and sizes in large datasets and eliminates the presence of noise and outliers. The proposed AMF-IDBSCAN algorithm uses a canopy clustering algorithm for pre-clustering the data sets to decrease the volume of data, applies an incremental DBSCAN for clustering the data points and Adaptive Median Filtering (AMF) technique for post-clustering to reduce the number of outliers by replacing noises by chosen medians. Experiments with AMF-IDBSCAN are performed on the University of California Irvine (UCI) repository UCI data sets. The results show that our algorithm performs better than DBSCAN, IDBSCAN, and DMDBSCAN. Povzetek: V članku je predstavljen nov algoritem AMF-IDBSCAN, izboljšana različica DBSCAN, ki uporablja grozdenje krošenj za zmanjšanje obsega podatkov in tehnike AMF za odpravo hrupa.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.