On the parallel complexity of hierarchical clustering and CC‐complete problems

Greenlaw, Raymond; Kantabutra, Sanpawat

doi:10.1002/cplx.20238

Cited by 11 publications

(15 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A prescribed flipping sequence is an ordering of edges in which each succeeding edge's labels may be flipped if and only if neither of its labels has already been flipped. This problem is NC-equivalent to the Lexicographically First Maximal Matching Problem, and so CC-complete; see [10] for a list of CCcomplete problems.…”

Section: Resultsmentioning

confidence: 99%

Fast Sequential and Parallel Vertex Relabelings of K_m,m

Kantabutra

2015

Int. J. Found. Comput. Sci.

Self Cite

View full text Add to dashboard Cite

Given an undirected, connected, simple graph G = (V, E), two vertex labelings L V and L ′ V of the vertices of G, and a label flip operation that interchanges a pair of labels on adjacent vertices, the Vertex Relabeling Problem is to transform G from L V into L ′ V using the flip operation. Agnarsson et al. showed solving the Vertex Relabeling Problem on arbitrary graphs can be done in θ(n 2 ), where n is the number of vertices in G. In this article we study the Vertex Relabeling Problem on graphs Km,m and introduce the concept of parity and precise labelings. We show that, when we consider the parity labeling, the problem on graphs Km,m can be solved quickly in O(log m) time using m processors on an EREW PRAM. Additionally, we also show that the number of processors can be further reduced to m log m in this case while the time complexity does not change. When the labeling is precise, the parallel time complexity increases by a factor of log m while the processor complexities remain m and m log m . We also show that, when graphs are restricted to Km,m, this problem can be solved optimally in O(m) time when the labeling is parity, and can be solved in O(m log m) time when the labeling is precise, thereby improving the result in Agnarsson et al. for this specific case. Moreover, we generalize the result in the case of precise labeling to the cases when L V and L ′ V can be any configuration. In the end we give a conclusion and a list of some interesting open problems.

show abstract

Section: Resultsmentioning

confidence: 99%

Fast Sequential and Parallel Vertex Relabelings of K_m,m

Kantabutra

2015

Int. J. Found. Comput. Sci.

Self Cite

View full text Add to dashboard Cite

show abstract

“…A variety of techniques have been developed for proving lower bounds on complexity of clustering [2,22,3]. When we run our Hadoop cluster on Amazon Elastic MapReduce, we can easily expand or shrink the number of virtual servers in our cluster depending on our processing needs.…”

Section: Introductionmentioning

confidence: 99%

How economical are Bounds on Inverted Index Summarization for Calculating Hadoop Channel?

Ravi¹,

Kiran²

2016

IJAIS

View full text Add to dashboard Cite

We develop a novel technique for resizable Hadoop cluster's lower bounds, the template matching rectangular array of inverted Index summarization expressions. Specifically, fix an arbitrary hybrid kernel function ݂ ∶ {0,1} → {0,1} and let ‫ܣ‬ be the rectangular array of inverted Index summarization expressions whose columns are each an application of ݂ to some subset of the variables ‫ݔ‬ ଵ , ‫ݔ‬ ଶ , … , ‫ݔ‬ ସ . We prove that ‫ܣ‬ has bounded-capacity resizable Hadoop cluster's complexity Ω(݀), where ݀ is the approximate degree of ݂. This finding remains valid in the MapReduce programming model, regardless of prior measurement. In particular, it gives a new and simple proof of lower bounds for robustness and other symmetric conjunctive predicates. We further characterize the discrepancy, approximate PageRank, and approximate trace distance norm of ‫ܣ‬ in terms of well-studied analytic properties of ݂, broadly generalizing several findings on small-bias resizable Hadoop cluster and agnostic inference. The method of this paper has also enabled important progress in multi-cloud resizable Hadoop cluster's complexity.

show abstract

“…r ← get_radius_centroid(c,C u ) 14: 6 Complexity on the actual demand or the relationship between the data in the dataset. For a covering with fewer sample points, the single linkage method (using the Euclidean distance) in the hierarchical clustering algorithm [24,25] is adopted to merge them to form an ellipsoidal domain, which means combing the most similar pair of clusters into a new cluster. Then, the similarities between the new cluster and the other clusters are updated, and the two most similar clusters are again merged.…”

mentioning

confidence: 99%

Self‐Adaptive K‐Means Based on a Covering Algorithm

et al. 2018

View full text Add to dashboard Cite

The K-means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number k in the K-means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved K-means clustering algorithm called the covering K-means algorithm (C-K-means). The C-K-means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the K-means. The first phase executes the CA. CA self-organizes and recognizes the number of clusters k based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, k is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C-K-means algorithm combines the advantages of CA and K-means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C-K-means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C-K-means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

show abstract

On the parallel complexity of hierarchical clustering and CC‐complete problems

Abstract: Complex data sets are often unmanageable unless they can be subdivided and simplified in an intelligent manner. Clustering is a technique that is used in data mining

Cited by 11 publications

References 29 publications

Fast Sequential and Parallel Vertex Relabelings of K_m,m

Fast Sequential and Parallel Vertex Relabelings of K_m,m

How economical are Bounds on Inverted Index Summarization for Calculating Hadoop Channel?

Self‐Adaptive K‐Means Based on a Covering Algorithm

Contact Info

Product

Resources

About

On the parallel complexity of hierarchical clustering and CC‐complete problems

Abstract: Complex data sets are often unmanageable unless they can be subdivided and simplified in an intelligent manner. Clustering is a technique that is used in data mining

Cited by 11 publications

References 29 publications

Fast Sequential and Parallel Vertex Relabelings of Km,m

Fast Sequential and Parallel Vertex Relabelings of Km,m

How economical are Bounds on Inverted Index Summarization for Calculating Hadoop Channel?

Self‐Adaptive K‐Means Based on a Covering Algorithm

Contact Info

Product

Resources

About

Fast Sequential and Parallel Vertex Relabelings of K_m,m

Fast Sequential and Parallel Vertex Relabelings of K_m,m