Abstract. High-utility itemset mining (HUIM) is an important data mining task with wide applications. In this paper, we propose a novel algorithm named EFIM (EFficient high-utility Itemset Mining), which introduces several new ideas to more efficiently discovers high-utility itemsets both in terms of execution time and memory. EFIM relies on two upper-bounds named sub-tree utility and local utility to more effectively prune the search space. It also introduces a novel array-based utility counting technique named Fast Utility Counting to calculate these upper-bounds in linear time and space. Moreover, to reduce the cost of database scans, EFIM proposes efficient database projection and transaction merging techniques. An extensive experimental study on various datasets shows that EFIM is in general two to three orders of magnitude faster and consumes up to eight times less memory than the state-of-art algorithms d2 HUP, HUI-Miner, HUP-Miner, FHM and UP-Growth+.
Document Information retrieval consists of finding the documents in a collection of documents that are the most relevant to a user query. Information retrieval techniques are widely-used by organizations to facilitate the search for information. However, applying traditional information retrieval techniques is time consuming for large document collections. Recently, cluster-based information retrieval approaches have been developed. Although these approaches are often much faster than traditional approaches for processing large document collections, the quality of the documents retrieved by cluster-based approaches is often less than that of traditional approaches. To address this drawback of cluster-based approaches, and improve the performance of information retrieval both in terms of runtime and quality of retrieved documents, this paper proposes a new cluster-based information retrieval approach named ICIR (Intelligent Cluster-based Information Retrieval). The proposed approach combines k-means clustering with frequent closed itemset mining to extract clusters of documents and find frequent terms in each cluster. Patterns discovered in each cluster are then used to select the most relevant document clusters to answer each user query. Four alternative heuristics are proposed to select the most relevant clusters, and two alternative heuristics for choosing documents in the selected clusters. Thus, eight versions of the proposed approach are obtained. To validate the proposed approach, extensive experiments have been carried out on well-known document collections. Results show that the designed approach outperforms traditional and cluster-based information retrieval approaches both in terms of execution time and quality of the returned documents.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.