a r t i c l e i n f o a b s t r a c tThe paper presents the history and present state of the GUHA method, its theoretical foundations and its relation and meaning for data mining. A survey of development of the GUHA method"GUHA" is the acronym for General Unary Hypotheses Automaton. The idea of the method is: given data, let the computer generate all (or as much as possible) interesting hypotheses of a given logical form that are supported by the data. This idea was elaborated by M. Chytil and P. Hájek in mid-sixties of the last century, the first paper in English being [16]. The approach was as follows: Data to be processed form a rectangular matrix of zeros and ones, rows corresponding to objects and columns to attributes (properties). Let P 1 , . . . , P n be names of the attributes. For each attribute P i , ¬P i is the name of its negation. An elementary conjunction of length k (1 k n) is a conjunction of k literals in which each predicate occurs at most once, e.g. ¬P 3 , P 1 & ¬P 3 & P 7 ; similarly an elementary disjunction (e.g. P 1 ∨ ¬P 3 ∨ P 7 ). An object satisfies an elementary conjunction if it satisfies all its members; it satisfies an elementary disjunction if it satisfies at least one of its members.Let 0 p 1. A formula A ⇒ p S where A is an elementary conjunction (antecedent) and S is an elementary disjunction (succedent) is true in the data if at least 100p percent of objects satisfying A satisfies S, i.e. a/r p where r is the number of objects satisfying A and a is the number of objects satisfying both A and S. The antecedent A is t-good (where t is a natural number) if at least t objects satisfy it. The version of GUHA described in [16] systematically generates "strongest" true formulas A ⇒ p S with a t-good antecedent, notation: A ⇒ p,t S. (Details omitted; "strongest" refers to a notion of a logical rule of immediate consequence among formulas of our form.) See also [2].The reader easily recognizes similarity with the notion of an "associational rule with support and confidence" introduced by Agrawal [1] about 25 years later: his A and S are elementary conjunctions containing no negation, p is the confidence and support is t/m, where m is the number of all objects in the data. 35 The formulas found by GUHA (i.e. by a computer program implementing it) have the form "almost all objects satisfying the antecedent satisfy the succedent (and the number of objects satisfying the antecedent is not too small)." It is stressed that the found results are formulas true in the data and they are hypotheses from the point of view of a universe from which the data are a sample. The slogan has been "GUHA offers everything interesting" (all hypotheses of the given form true in the data). The first implementation (by I. Havel) worked on a computer MINSK22.In 1968 Hájek (in a paper in Czech) suggested a different version based on the statistical Fisher test. Given A and S (now two elementary conjunctions with no predicates in common), let a, b, c, d be the numbers of objects satisfying A & B,A & ¬B, ¬A & B and ¬...
No abstract
A b s t r a c t . Observational calculi were defined in relation to GUHA method of mechanising hypotheses formation. Formulae of observational calculi correspond to statistical hypothesis tests and various further assertions verificated in the process of data analysis. An example of application of the GUHA procedure PC-ASSOC is described in the paper. Logical relation among formulae of observational calculi are discussed and some important results concerning deduction rules are shown. Possibilities of applications of logical properties of formulae corresponding to hypotheses tests in the field of KDD are suggested. I n t r o d u c t i o nThe goal of this paper is to introduce special logical calculi as a useful tool for Knowledge Discovery in Databases (KDD). We start with the following two facts:-Each database can be understood as a formally described data structure. We refer to a fact that particular relations and fields have their own names. l~esults of methods of data mining are assertions dealing with these names. Assertions are in various form, e.g. association rules [1], results of statistical hypotheses tests or presentation graphs. Anyway, each such assertion can be understood as a formal expression concerning a formal data structure.Mathematical logic studies formal languages and formal data structures as their models. It is defined what does it mean that a sentence of formal language is true/false in a model. A very known example is first-order predicate calculus. There is lot of interesting results concerning universally valid formulas, deduction rules, an axiomatization, a decidability, etc. see e.g. [6].We are going to argue that some of these logical concepts are or ,could be useful from the point of view of KDD. a) Observational calculi were defined and studied in relation to GUHA methods of mechanising hypotheses formation [2]. GUHA is a method of exploratory data analysis and it is also successfully used as a method of KDD [10]. The goal of GUHA method is to offer all interesting facts following from the analysed data to the given problem. GUHA is realised by GUHA-procedures. GUHA-procedure is a computer program, the input of which consists of the 48 analysed data and a few parameters defining a very large set of potentially interesting hypotheses (usually 104 -106 Informally speaking, such a deduction rule says that if a hypothesis is supported by the analysed data than also a hypothesis ~" is supported by these data, a relatively simple condition concerning ~ and ~" must however be satisfied. It is possible to show, that this condition is the same both for simple association rule and for complicated statistical tests. More information is in Sect. 3. e) The above mentioned deduction rules are interesting not only from the point of view of GUHA procedures. They could be useful also in the process of interpretation of results of data mining. One of trends in this area is to arrange results into an analytic-synthetical report structured both according to the analysed problem and to the reader's n...
First experiences with utilization of formalized items of domain knowledge in a process of association rules mining are described. We use association rules -atomic consequences of items of domain knowledge and suitable deduction rules to filter out uninteresting association rules. The approach is experimentally implemented in the LISp-Miner system.
scite is a Brooklyn-based startup that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
334 Leonard St
Brooklyn, NY 11211
Copyright © 2023 scite Inc. All rights reserved.
Made with 💙 for researchers