Scalable Hierarchical Clustering Method for Sequences of Categorical Values

Morzy, Tadeusz; Wojciechowski, Marek; Zakrzewicz, Maciej

doi:10.1007/3-540-45357-1_31

Cited by 26 publications

(11 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…POPC algorithm proposed in [19], which starts with a set of elementary sub-clusters and merges them iteratively until the pre-defined stop condition defined in advance is satisfied, is pretty similar to the method introduced in this paper. In [19], the author introduces two variants as POPC-J using the Jaccard coefficient [16] of the clusters' contents and POPC-GA using the group average of co-occurrences of patterns describing clustering.…”

Section: Definition 15 (Session P)mentioning

confidence: 95%

See 1 more Smart Citation

A hierarchical clustering algorithm based on fuzzy graph connectedness

Dong¹,

Zhuang²,

Chen³

et al. 2006

Fuzzy Sets and Systems

View full text Add to dashboard Cite

Section: Definition 15 (Session P)mentioning

confidence: 95%

“…In [19], the author introduces two variants as POPC-J using the Jaccard coefficient [16] of the clusters' contents and POPC-GA using the group average of co-occurrences of patterns describing clustering. POPC-GA is selected to compare with the proposed algorithm because POPC-GA is much more efficient than POPC-J.…”

Section: Definition 15 (Session P)mentioning

confidence: 99%

A hierarchical clustering algorithm based on fuzzy graph connectedness

Dong¹,

Zhuang²,

Chen³

et al. 2006

Fuzzy Sets and Systems

View full text Add to dashboard Cite

“…Clones may also form implicit links between components that share some functionality. All this contributes towards "software aging" [36].…”

Section: The Cloning Problemmentioning

confidence: 99%

Detecting higher-level similarity patterns in programs

Basit

Jarzabek

2005

SIGSOFT Softw. Eng. Notes

View full text Add to dashboard Cite

Cloning in software systems is known to create problems during software maintenance. Several techniques have been proposed to detect the same or similar code fragments in software, so-called simple clones. While the knowledge of simple clones is useful, detecting design-level similarities in software could ease maintenance even further, and also help us identify reuse opportunities. We observed that recurring patterns of simple clones -so-called structural clones -often indicate the presence of interesting design-level similarities. An example would be patterns of collaborating classes or components. Finding structural clones that signify potentially useful design information requires efficient techniques to analyze the bulk of simple clone data and making non-trivial inferences based on the abstracted information.In this paper, we describe a practical solution to the problem of detecting some basic, but useful, types of design-level similarities such as groups of highly similar classes or files. First, we detect simple clones by applying conventional token-based techniques. Then we find the patterns of co-occurring clones in different files using the Frequent Itemset Mining (FIM) technique. Finally, we perform file clustering to detect those clusters of highly similar files that are likely to contribute to a design-level similarity pattern. The novelty of our approach is application of data mining techniques to detect design level similarities. Experiments confirmed that our method finds many useful structural clones and scales up to big programs. The paper describes our method for structural clone detection, a prototype tool called Clone Miner that implements the method and experimental results.

show abstract

“…Morzy et al [8] assumed that sequential patterns were given and then started clustering with data that included more than one of these given sequential patterns. Hay et al [5] presented a clustering algorithm that used an edit distance method to measure the similarity between sequences, while Wang and Zaiane [9] proposed a clustering method based on a sequence alignment method to measure the similarity between sequences.…”

Section: Related Workmentioning

confidence: 99%