Abstract:Nichnan KITTIPHATTANABAWON†a) , Student Member, Thanaruk THEERAMUNKONG †b) , and Ekawit NANTAJEEWARAWAT †c) , Members SUMMARY Recently, to track and relate news documents from several sources, association rule mining has been applied due to its performance and scalability. This paper presents an empirical investigation on how term representation basis, term weighting, and association measure affects the quality of relations discovered among news documents. Twenty four combinations initiated by two term represe… Show more
“…A representative pattern is generated in either linear-or exponential scale. As the graphic relationship, the relation discovered among news documents [29] is presented on the measurement values with its confidence and the size ratio.…”
Section: Measure Characterization By Pairwise Comparisonmentioning
Objective interestingness measures play a vital role in association rule mining of a large-scaled database because they are used for extracting, filtering, and ranking the patterns. In the past, several measures have been proposed but their similarities or relations are not sufficiently explored. This work investigates sixty-one objective interestingness measures on the pattern of A → B, to analyze their similarity and dissimilarity as well as their relationship. Three-probability patterns, P(A), P(B), and P(AB), are enumerated in both linear and exponential scales and each measure's values of those conditions are calculated, forming synthesis data for investigation. The behavior of each measure is explored by pairwise comparison based on these three-probability patterns. The relationship among the sixty-one interestingness measures has been characterized with correlation analysis and association rule mining. In the experiment, relationships are summarized using heat-map and association rule mined. As the result, selection of an appropriate interestingness measure can be realized using the generated heat-map and association rules.
“…A representative pattern is generated in either linear-or exponential scale. As the graphic relationship, the relation discovered among news documents [29] is presented on the measurement values with its confidence and the size ratio.…”
Section: Measure Characterization By Pairwise Comparisonmentioning
Objective interestingness measures play a vital role in association rule mining of a large-scaled database because they are used for extracting, filtering, and ranking the patterns. In the past, several measures have been proposed but their similarities or relations are not sufficiently explored. This work investigates sixty-one objective interestingness measures on the pattern of A → B, to analyze their similarity and dissimilarity as well as their relationship. Three-probability patterns, P(A), P(B), and P(AB), are enumerated in both linear and exponential scales and each measure's values of those conditions are calculated, forming synthesis data for investigation. The behavior of each measure is explored by pairwise comparison based on these three-probability patterns. The relationship among the sixty-one interestingness measures has been characterized with correlation analysis and association rule mining. In the experiment, relationships are summarized using heat-map and association rule mined. As the result, selection of an appropriate interestingness measure can be realized using the generated heat-map and association rules.
“…However, most works on these approaches focused on classification, not clustering. Some researchers suggested a combined term weighting to avoid bias from a single source of information [17]. While most works on term weighting focused on classification, it is still an open question on which statistical approaches are suitable for unsupervised learning, like the clustering task.…”
The outcome of document clustering depends on the scheme used to assign a weight to each term in a document. While recent works have tried to use distributions related to class to enhance the discrimination ability. It is worth exploring whether a deviation approach or an entropy approach is more effective. This paper presents a comparison between deviation-based distribution and entropy-based distribution as constraints in term weighting. In addition, their potential combinations are investigated to find optimal solutions in guiding the clustering process. In the experiments, the seeded k-means method is used for clustering, and the performances of deviation-based, entropy-based, and hybrid approaches, are analyzed using two English and one Thai text datasets. The result showed that the deviation-based distribution outperformed the entropy-based distribution, and a suitable combination of these distributions increases the clustering accuracy by 10%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.