Itemsets provide local descriptions of the data. This work proposes to use itemsets as basic means for classification purposes too. To enable this, the concept of class support SUpi of an itemset is introduced, i.e., how many times an itemset occurs when a specific class Ci is present. Class supports of frequent itemsets are computed in the training phase. Upon arrival of a new case to be classified, some of the generated itemsets are selected and their class supports SUpi are used to compute the probability that the case belongs to class cc The result is the class ci with highest such probability. We show that selecting and combining many and long itemsets providing new evidence (interesting) is an effective strategy for computing the class probabilities. The proposed classification technique is called Large Bayes as it happens to reduce to Naive Bayes classifier when all itemsets selected are of size one only. Experimental results on a large number of benchmark data sets show that Large Bayes consistently outperforms the widely used Ndive Bayes classifier. In many cases, Large Bayes is also superior to other state of the art classification methods such as C4.5, CBA (a recently proposed rule classifier built from association rules), and TAN (a Bayesian network extension of Ndive Bayes).
Data mining can be described as "making better use of data". Every human being is increasingly faced with unmanageable amounts of data, hence, data mining or knowledge discovery apparently affects all of us. It is therefore recognized as one of the key research areas. Ideally, we would like to develop techniques for "making better use of any kind of data for any purpose". However, we argue that this goal is too demanding yet. It may sometimes be more promising to develop techniques applicable to specific data and with a specific goal in mind. In this paper, we describe such an application driven data mining system.Our aim is to predict stock markets using information contained in articles published on the Web. Mostly textual articles appearing in the leading and influential financial newspapers are taken as input. From those articles the daily closing values of major stock market indices in Asia, Europe and America are predicted. Textual statements contain not only the effect (e.g. the stocks plummet) but also why it happened (e.g. because of weakness in the dollar and consequently a weakening of the treasury bonds). Exploiting textual information in addition to numeric time series data increases the quality of the input. Hence improved predictions are expected. The forecasts are available real-time via www.cs.ust.hk/~beat/Predict daily at 7:45 am Hong Kong time. Hence all predictions are ready before Tokyo, Hong Kong and Singapore, the major Asian markets, start trading. The system's accuracy for this tremendously difficult but also extremely challenging application is highly promising.
We predict stock markets using Information contained In articles published on the Web. Mostly textual articles appearing In the leading and the most Influential financial newspapers are taken as Input. From those articles the dally closing values of major stock market Indices In Asia, Europe and America are predicted. Textual statements contain not only the effect (e.g., stocks down) but also the possible causes of the event (e.g., stocks down because of weakness In the dollar and consequently a weakening Of the treasury bands). Exploiting textual information therefore Increases the quality of the Input. The forecasts are available real-time via www.cs.ust.hkl-beatiPredict daUy at 7:45 am Hong Kong time. Hence aU predictions are available before the major Asian markets, Tokyo, Hong Kong and Singapore, start trading. Several techniques, such as rule-based, k-NN algorithm and neural net, have been employed to produce the forecasts. Those techniques are compared with one another. A trading strategy based on the system's forecast Is suggested. This strategy Is shown to potentiaUy outperform stock fund managers. This suggests that It will be extremely difficult to further improve the system's accuracy. Hence the performance Is very close to what can be expected In the best case from a system or even fram human beings.
Large Bayes (LB) is a recently introduced classifier built from frequent and interesting itemsets. LB uses itemsets to create context-specific probabilistic models of the data and estimate the conditional probability P(c i |A) of each class c i given a case A. In this paper we use chi-square tests to address several drawbacks of the originally proposed interestingness metric, namely: (i) the inability to capture certain really interesting patterns, (ii) the need for a userdefined and data dependent interestingness threshold, and (iii) the need to set a minimum support threshold. We also introduce some pruning criteria which allow for a trade-off between complexity and speed on one side and classification accuracy on the other. Our experimental results show that the modified LB outperforms the original LB, Naïve Bayes, C4.5 and TAN.
Two main topics are addressed. First, an algebraic approach is presented to define a general notion of expressive power. Heterogeneous algebras represent information systems and morphisms represent the correspondences between the instances of databases, the correspondences between answers, and the correspondences between queries. An important feature of this new notion of expressive power is that query languages of different types can be compared with respect to their expressive power. In the case of relational query languages, the new notion of expressive power is shown to be equivalent to the notion used by Chandra and Harel. In the case of nonrelational query languages, the versatility of the new notion of expressive power is demonstrated by comparing the fixpoint query languages with an object-oriented query language called FQL. The expressive power of the Functional Query Language FQL is the second main topic of this paper. The specifications of FQL functions can be recursive or even mutually recursive, FQL has a fixpoint semantics based on a complete lattice consisting of bag functions. The query language FQL is shown to be more expressive than the fixpoint query languages. This result implies that FQL is also more expressive than Datalog with stratified negation. Examples of recursive FQL functions are given that determine the ancestors of persons and the bill of materials.
We define a new fixpoint semantics for rule-based reasoning in the presence of weighted information. The semantics is illustrated on a real-world application requiring such reasoning. Optimizations and approximations of the semantics are shown so as to make the semantics amenable to very large scale real-world applications. We finally prove that the semantics is probabilistic and reduces to the usual fixpoint semantics of stratified Datalog if all information is certain. We implemented various knowledge discovery systems which automatically generate such probabilistic decision rules. In collaboration with a bank in Hong Kong we use one such system to forecast currency exchange rates.Index Terms-Axiomatic probability theory, data mining, incomplete information, knowledge discovery in databases, query optimization and approximation, stratified Datalog.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.