Abstract-This paper presents a new scalable algorithm for discovering closed frequent itemsets, a lossless and condensed representation of all the frequent itemsets that can be mined from a transactional database.Our algorithm exploits a divide-and-conquer approach and a bitwise vertical representation of the database, and adopts a particular visit and partitioning strategy of the search space based on an original theoretical framework, which formalizes the problem of closed itemsets mining in detail. The algorithm adopts several optimizations aimed to save both space and time in computing itemset closures and their supports. In particular, since one of the main problems raising up in this type of algorithms is the multiple generation of the same closed itemset, we propose a new effective and memory-efficient pruning technique, which, unlike other previous proposals, does not require the whole set of closed patterns mined so far to be kept in the main memory. This technique also permits each visited partition of the search space to be mined independently in any order and thus also in parallel.The tests conducted on many publicly available datasets show that our algorithm is scalable and outperforms other state-ofthe-art algorithms like CLOSET+ and FP-CLOSE, in some cases by more than one order of magnitude. More importantly, the performance improvements become more and more significant as the support threshold is decreased.
The research challenge addressed in this paper is to devise e↵ective techniques for identifying task-based sessions, i.e. sets of possibly non contiguous queries issued by the user of a Web Search Engine for carrying out a given task. In order to evaluate and compare di↵erent approaches, we built, by means of a manual labeling process, a ground-truth where the queries of a given query log have been grouped in tasks. Our analysis of this ground-truth shows that users tend to perform more than one task at the same time, since about 75% of the submitted queries involve a multi-tasking activity. We formally define the Task-based Session Discovery Problem (TSDP) as the problem of best approximating the manually annotated tasks, and we propose several variants of well known clustering algorithms, as well as a novel e cient heuristic algorithm, specifically tuned for solving the TSDP. These algorithms also exploit the collaborative knowledge collected by Wiktionary and Wikipedia for detecting query pairs that are not similar from a lexical content point of view, but actually semantically related. The proposed algorithms have been evaluated on the above groundtruth, and are shown to perform better than state-of-the-art approaches, because they e↵ectively take into account the multi-tasking behavior of users.
Although Web search engines still answer user queries with lists of ten blue links to webpages, people are increasingly issuing queries to accomplish their daily tasks (e.g., finding a recipe, booking a flight, reading online news, etc.). In this work, we propose a two-step methodology for discovering tasks that users try to perform through search engines. First, we identify user tasks from individual user sessions stored in search engine query logs. In our vision, a user task is a set of possibly noncontiguous queries (within a user search session), which refer to the same need. Second, we discover collective tasks by aggregating similar user tasks, possibly performed by distinct users. To discover user tasks, we propose query similarity functions based on unsupervised and supervised learning approaches. We present a set of query clustering methods that exploit these functions in order to detect user tasks. All the proposed solutions were evaluated on a manually-built ground truth, and two of them performed better than state-of-the-art approaches. To detect collective tasks, we propose four methods that cluster previously discovered user tasks, which in turn are represented by the bag-of-words extracted from their composing queries. These solutions were also evaluated on another manually-built ground truth.
This paper presents a parallel programming methodology that ensures easy programming, e ciency, and portability of programs to di erent m a c hines belonging to the class of the generalpurpose, distributed memory, MIMD architectures. The methodology is based on the de nition of a new, high-level, explicitly parallel language, called P 3 L, and of a set of static tools that automatically adapt the program features for each target architecture. P 3 L does not require programmers to specify process activations, the actual parallelism degree, scheduling, or interprocess communications, i.e. all those features that need to be adjusted to harness each speci c target machine. Parallelism is, on the other hand, expressed in a structured and qualitative w ay, b y hierarchical composition of a restricted set of language constructs, corresponding to those forms of parallelism that are frequently encountered in parallel applications, and that can e ciently be implemented.The e cient portability o f P 3 L applications is guaranteed by the compiler along with the novel structure of the support. The compiler automatically adapts the program features for each speci c architecture, accessing the costs (in terms of performance) of the low-level mechanisms exported by t h e a r c hitecture itself. In our methodology, these costs, along with other features of the architecture, are viewed through an abstract machine, whose mechanism interface is used by the compiler to produce the nal object code.1
Learning-to-Rank models based on additive ensembles of regression trees have been proven to be very effective for scoring query results returned by large-scale Web search engines. Unfortunately, the computational cost of scoring thousands of candidate documents by traversing large ensembles of trees is high. Thus, several works have investigated solutions aimed at improving the efficiency of document scoring by exploiting advanced features of modern CPUs and memory hierarchies. In this article, we present Q uick S corer , a new algorithm that adopts a novel cache-efficient representation of a given tree ensemble, performs an interleaved traversal by means of fast bitwise operations, and supports ensembles of oblivious trees. An extensive and detailed test assessment is conducted on two standard Learning-to-Rank datasets and on a novel very large dataset we made publicly available for conducting significant efficiency tests. The experiments show unprecedented speedups over the best state-of-the-art baselines ranging from 1.9 × to 6.6 × . The analysis of low-level profiling traces shows that Q uick S corer efficiency is due to its cache-aware approach in terms of both data layout and access patterns and to a control flow that entails very low branch mis-prediction rates.
In this paper we present a formal framework for modelling a trajectory data warehouse (TDW), namely a data warehouse aimed at storing aggregate information on trajectories of moving objects, which also offers visual OLAP operations for data analysis. The data warehouse model includes both temporal and spatial dimensions, and it is flexible and general enough to deal with objects that are either completely free or constrained in their movements (e.g., they move along a road network). In particular, the spatial dimension and the associated concept hierarchy reflect the structure of the environment in which the objects travel. Moreover, we cope with some issues related to the efficient computation of aggregate measures, as needed for implementing roll-up operations. The TDW and its visual interface allow one to investigate the behaviour of objects inside a given area as well as the movements of objects between areas in the same neighbourhood. A user can easily navigate the aggregate measures obtained from OLAP queries at different granularities, and get overall views in time and in space of the measures, as well as a focused view on specific measures, spatial areas, or temporal intervals. We discuss two application scenarios of our TDW, namely road traffic and vessel movement analysis, for which we built prototype systems. They mainly differ in the kind of information available for the moving objects under observation and their movement constraints
Entity Linking is the task of detecting, in text documents, relevant mentions to entities of a given knowledge base. To this end, entity-linking algorithms use several signals and features extracted from the input text or from the knowledge base. The most important of such features is entity relatedness. Indeed, we argue that these algorithms benefit from maximizing the relatedness among the relevant entities selected for annotation, since this minimizes errors in disambiguating entity-linking.The definition of an effective relatedness function is thus a crucial point in any entity-linking algorithm. In this paper we address the problem of learning high-quality entity relatedness functions. First, we formalize the problem of learning entity relatedness as a learning-to-rank problem. We propose a methodology to create reference datasets on the basis of manually annotated data. Finally, we show that our machine-learned entity relatedness function performs better than other relatedness functions previously proposed, and, more importantly, improves the overall performance of different state-of-the-art entity-linking algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.