Abstract. Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Previous work in text mining focused at the word or the tag level. This paper presents an approach to performing text mining at the term level. The mining process starts by preprocessing the document collection and extracting terms from the documents. Each document is then represented by a set of terms and annotations characterizing the document. Terms and additional higher-level entities are then organized in a hierarchical taxonomy. In this paper we will describe the Term Extraction module of the Document Explorer system, and provide experimental evaluation performed on a set of 52,000 documents published by Reuters in the years 1995-1996.
In the general framework of knowledge discovery, Data Mining techniques are usually dedicated to information extraction from structured databases. Text Mining techniques, on the other hand, are dedicated to information extraction from unstructured textual data and Natural Language Processing (NLP) can then be seen as an interesting tool for the enhancement of information extraction procedures. In this paper, we present two examples of Text Mining tasks, association extraction and prototypical document extraction, along with several related NLP techniques.
The objective of this document is to present a rapid dialogue prototyping methodology developed at the Artificial Intelligent Laboratory -Ecole Polytechnique Federale de Lausanne. Concretely, the rapid dialogue prototyping methodology is decomposed into 5 consecutive main steps: (1) producing the task model; (2) deriving the initial dialogue model; (3) using a Wizard-of-Oz experiment to instantiate the initial dialogue model; (4) using an internal field test to refine the dialogue model; and (5) using an external field test to evaluate the final dialogue model.
In this paper, we present a query-driven indexing/retrieval strategy for efficient full text retrieval from large document collections distributed within a structured P2P network. Our indexing strategy is based on two important properties: (1) the generated distributed index stores posting lists for carefully chosen indexing term combinations, and (2) the posting lists containing too many document references are truncated to a bounded number of their top-ranked elements. These two properties guarantee acceptable storage and bandwidth requirements, essentially because the number of indexing term combinations remains scalable and the transmitted posting lists never exceed a constant size. However, as the number of generated term combinations can still become quite large, we also use term statistics extracted from available query logs to index only such combinations that are frequently present in user queries. Thus, by avoiding the generation of superfluous indexing term combinations, we achieve an additional substantial reduction in bandwidth and storage consumption. As a result, the generated distributed index corresponds to a constantly evolving query-driven indexing structure that efficiently follows current information needs of the users. More precisely, our theoretical analysis and experimental results indicate that, at the price of a marginal loss in retrieval quality for rare queries, the generated index size and network traffic remain manageable even for web-size document collections. Furthermore, our experiments show that at the same time the achieved retrieval quality is fully comparable to the one obtained with a state-of-the-art centralized query engine.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.