Hierarchical document clustering using local patterns

Malik, Hassan; Kender, John R.; Fradkin, Dmitriy; Moerchen, Fabian

doi:10.1007/s10618-010-0172-z

Cited by 24 publications

(13 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Malik et al proposed a pattern-based instance-driven hierarchical clustering algorithm (IDHC) that builds a cluster hierarchy without mining for globally signi¯cant patterns. 21 While Antonio et al presented a divisive hierarchical method which is based on the use of the k-means method embedded in a recursive algorithm to obtain a clustering at each node of the hierarchy. 4 Liao et al introduces a sample-based hierarchical adaptive K-means (SHAKM) clustering algorithm, which employs multilevel random sampling to handle large databases and utilizes the adaptive K-means clustering algorithm to determine the correct number of clusters.…”

Section: Reviews Of Clustering Methodsmentioning

confidence: 99%

Parallel N-Path Quantification Hierarchical K-Means Clustering Algorithm for Video Retrieval

Liao

Zhao

Zheng

et al. 2017

Int. J. Patt. Recogn. Artif. Intell.

View full text Add to dashboard Cite

Using clustering method to detect useful patterns in large datasets has attracted considerable interest recently. The HKM clustering algorithm (Hierarchical K-means) is very e±cient in large-scale data analysis. It has been widely used to build visual vocabulary for large scale video/image retrieval system. However, the speed and even the accuracy of hierarchical K-means clustering algorithm still have room to be improved. In this paper, we propose a Parallel N-path quanti¯cation hierarchical K-means clustering algorithm which improves on the hierarchical K-means clustering algorithm in the following ways. Firstly, we replace the Euclidean kernel with the Hellinger kernel to improve the accuracy. Secondly, the Greedy N-best Paths Labeling method is adopted to improve the clustering accuracy. Thirdly, the multi-core processors-based parallel clustering algorithm is proposed. Our results con¯rm that the proposed clustering algorithm is much faster and more e®ective.

show abstract

Section: Reviews Of Clustering Methodsmentioning

confidence: 99%

Parallel N-Path Quantification Hierarchical K-Means Clustering Algorithm for Video Retrieval

Liao

Zhao

Zheng

et al. 2017

Int. J. Patt. Recogn. Artif. Intell.

View full text Add to dashboard Cite

show abstract

“…More recently Fukumoto and Suzuki performed cluster labeling by relying on concepts in a machine readable dictionary (Fukumoto and Suzuki, 2011) with positive results. In another distinct recent work, Malik, et al, focused on finding patterns (i.e., labels) and clusters simultaneously as an alternative to explicitly identifying labels for existing clusters (Malik et al, 2010).…”

Section: Related Workmentioning

confidence: 99%

Bringing Order to Legal Documents - An Issue-based Recommendation System Via Cluster Association

Lü

Conrad

2012

Proceedings of the International Conference on Knowledge Engineering and Ontology Development

View full text Add to dashboard Cite

Abstract:The task of recommending content to professionals (such as attorneys or brokers) differs greatly from the task of recommending news to casual readers. A casual reader may be satisfied with a couple of good recommendations, whereas an attorney will demand precise and comprehensive recommendations from various content sources when conducting legal research. Legal documents are intrinsically complex and multi-topical, contain carefully crafted, professional, domain specific language, and possess a broad and unevenly distributed coverage of issues. Consequently, a high quality content recommendation system for legal documents requires the ability to detect significant topics from a document and recommend high quality content accordingly. Moreover, a litigation attorney preparing for a case needs to be thoroughly familiar the principal arguments associated with various supporting opinions, but also with the secondary and tertiary arguments as well. This paper introduces an issue-based content recommendation system with a built-in topic detection/segmentation algorithm for the legal domain. The system leverages existing legal document metadata such as topical classifications, document citations, and click stream data from user behavior databases, to produce an accurate topic detection algorithm. It then links each individual topic to a comprehensive pre-defined topic (cluster) repository via an association process. A cluster labeling algorithm is designed and applied to provide a precise, meaningful label for each of the clusters in the repository, where each cluster is also populated with member documents from across different content types. This system has been applied successfully to very large collections of legal documents, O(100M), which include judicial opinions, statutes, regulations, court briefs, and analytical documents. Extensive evaluations were conducted to determine the efficiency and effectiveness of the algorithms in topic detection, cluster association, and cluster labeling. Subsequent evaluations conducted by legal domain experts have demonstrated that the quality of the resulting recommendations across different content types is close to those created by human experts.

show abstract

“…The number of clusters k was set to obtain clusters that averaged 100 documents (i.e., if clustering 1000 documents, k = 10). We additionally used a more recent clustering algorithm, IDHC [9], which is different than k-means in that it does not take a parameter k and produces a variable number of clusters.…”

Section: A Experimental Setupmentioning

confidence: 99%

Automatic Training Data Cleaning for Text Classification

Malik

Bhardwaj

2011

2011 IEEE 11th International Conference on Data Mining Workshops

Self Cite

View full text Add to dashboard Cite

Abstract-Supervised text classification algorithms rely on the availability of large quantities of quality training data to achieve their optimal performance. However, not all training data is created equal and the quality of class-labels assigned by human experts may vary greatly with their levels of experience, domain knowledge, and the time available to label each document. In our experiments, focused label validation and correction by expert journalists improved the Micro and Macro-F1 scores achieved by Linear SVMs by as much as 14.5% and 30% respectively, on a corpus of professionally labeled news stories.Manual label correction is an expensive and time consuming process and the classification quality may not linearly improve with the amount of time spent, making it increasingly more expensive to achieve higher classification quality targets. We propose ATDC, a novel evidence-based training data cleaning method that uses training examples with high-quality classlabels to automatically validate and correct labels of noisy training data. A subset of these instances are then selected to augment the original training set. On a large noisy dataset with about two million news stories, our method improved the baseline Micro-F1 and Macro-F1 scores by 9% and 13% respectively, without requiring any further human intervention.

show abstract

Hierarchical document clustering using local patterns

Cited by 24 publications

References 21 publications

Parallel N-Path Quantification Hierarchical K-Means Clustering Algorithm for Video Retrieval

Parallel N-Path Quantification Hierarchical K-Means Clustering Algorithm for Video Retrieval

Bringing Order to Legal Documents - An Issue-based Recommendation System Via Cluster Association

Automatic Training Data Cleaning for Text Classification

Contact Info

Product

Resources

About