Enhanced word clustering for hierarchical text classification

Dhillon, Inderjit S.; Mallela, Subramanyam; Kumar, Rahul

doi:10.1145/775047.775076

Cited by 95 publications

(54 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The works of (Slonim & Tishby, 2000), , (Yaniv & Souroujon, 2001) use heuristic procedures to cluster documents and features independently using an agglomerative algorithm. (Dhillon et al, 2002(Dhillon et al, , 2003b on the other hand, propose an information-theoretic coclustering algorithm that intertwines both row (feature) and column (document) clustering. The algorithm starts with a random partition of rows, X, and columns, Y, and computes an approximation q(X,Y) to the original distribution P(X,Y) and a corresponding compressed distribution by co-clustering rows and columns intertwined, i.e.…”

Section: Co-clustering (Clustering Features and Documents)mentioning

confidence: 99%

“…Experiments conducted demonstrate the efficiency of the algorithm especially in the presence of sparsity. (Dai et al, 2007) extend the co-clustering algorithm of (Dhillon et al, 2002(Dhillon et al, , 2003b and present a co-clustering classification algorithm (CoCC) that focuses on classifying documents across different text domains. There is a labelled data set D i from one domain, called in-domain, and an unlabelled data set D o from a related but different domain, called out-of-domain, that is to be classified.…”

Section: Co-clustering (Clustering Features and Documents)mentioning

confidence: 99%

See 1 more Smart Citation

Text Classification Aided by Clustering: a Literature Review

Kyriakopoulou¹

2008

Tools in Artificial Intelligence

View full text Add to dashboard Cite

Section: Co-clustering (Clustering Features and Documents)mentioning

confidence: 99%

Section: Co-clustering (Clustering Features and Documents)mentioning

confidence: 99%

Text Classification Aided by Clustering: a Literature Review

Kyriakopoulou¹

2008

Tools in Artificial Intelligence

View full text Add to dashboard Cite

“…A connection between multinomial model-based clustering and the divisive KullbackLeibler clustering (Dhillon et al, 2002;Dhillon & Guan, 2003) is worth mentioning here. It is briefly mentioned in Dhillon and Guan (2003) but they did not explicitly stress that the divisive KL clustering is equivalent to multinomial model-based k-means, which maximizes the following objective function:…”

Section: Multinomial Modelsmentioning

confidence: 99%

Semi-supervised model-based document clustering: A comparative study

2006

Mach Learn

View full text Add to dashboard Cite

Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semisupervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete.

show abstract

“…24;Rasmussen 1992;Silverstein and Pedersen 1997;Jain et al 1999;Manning and Schutze 2001: ch. 14;Everitt et al 2001;Duda et al 2001;Dillon et al 2002;Berkhin 2000;Yao and Choi 2003).…”

Section: Word Clusteringunclassified

Web Page Classification*

Choi¹,

Yao²

2005

Studies in Fuzziness and Soft Computing

View full text Add to dashboard Cite

Enhanced word clustering for hierarchical text classification

Cited by 95 publications

References 20 publications

Text Classification Aided by Clustering: a Literature Review

Text Classification Aided by Clustering: a Literature Review

Semi-supervised model-based document clustering: A comparative study

Web Page Classification*

Contact Info

Product

Resources

About