The ability to extract knowledge from data has been the driving force of Data Mining since its inception, and of statistical modelling long before even that. Actionable knowledge often takes the form of patterns, where a set of antecedents can be used to infer a consequent. In this paper we offer a solution to the problem of comparing different sets of patterns. Our solution allows comparisons between sets of patterns that were derived from different techniques (such as different classification algorithms), or made from different samples of data (such as temporal data or data perturbed for privacy reasons). We propose using the Jaccard index to measure the similarity between sets of patterns by converting each pattern into a single element within the set. Our measure focuses on providing conceptual simplicity, computational simplicity, interpretability, and wide applicability. The results of this measure are compared to prediction accuracy in the context of a real-world data mining scenario.
Recent work on the hole argument in general relativity by Weatherall (2016b) has drawn attention to the neglected concept of (mathematical) models' representational capacities. I argue for several theses about the structure of these capacities, including that they should be understood not as many-to-one relations from models to the world, but in general as many-tomany relations constrained by the models' isomorphisms. I then compare these ideas with a recent argument by Belot (2017) for the claim that some isometries "generate new possibilities" in general relativity. Philosophical orthodoxy, by contrast, denies this. Properly understanding the role of representational capacities, I argue, reveals how Belot's rejection of orthodoxy does not go far enough, and makes better sense of our practices in theorizing about spacetime. * I would like to thank Gordon Belot, Neil Dewar, Ben Feintzeig, Jim Weatherall, and an anonymous referee for encouraging comments on a previous draft of this essay, which was written in part with the support from a Marie Curie Fellowship (PIIF-GA-2013-628533).1 See also Earman (1989). For reviews of the vast literature on the subject, from a range of philosophical and physical perspectives, including its bearing on broader debates about the metaphysics of spacetime, see Pooley (2013), Stachel (2014), Norton (2015, and references therein.
One implication of Bell's theorem is that there cannot in general be hidden variable models for quantum mechanics that both are noncontextual and retain the structure of a classical probability space. Thus, some hidden variable programs aim to retain noncontextuality at the cost of using a generalization of the Kolmogorov probability axioms. We generalize a theorem of Feintzeig (2015) to show that such programs are committed to the existence of a finite null cover for some quantum mechanical experiments, i.e., a finite collection of probability zero events whose disjunction exhausts the space of experimental possibilities.
Stephen Hawking, among others, has proposed that the topological stability of a property of spacetime is a necessary condition for it to be physically significant. What counts as stable, however, depends crucially on the choice of topology. Some physicists have thus suggested that one should find a canonical topology, a single "right" topology for every inquiry. While certain such choices might be initially motivated, some little-discussed examples of Geroch and some propositions of my own show that the main candidates-and each possible choice, to some extent-faces the horns of a no-go result. I suggest that instead of trying to decide what the "right" topology is for all problems, one should let the details of particular types of problems guide the choice of an appropriate topology.
If the force on a particle fails to satisfy a Lipschitz condition at a point, it relaxes one of the conditions necessary for a locally unique solution to the particle's equation of motion. I examine the most discussed example of this failure of determinism in classical mechanics-that of Norton's dome-and the range of current objections against it. Finding there are many different conceptions of classical mechanics appropriate and useful for different purposes, I argue that no single conception is preferred. Instead of arguing for or against determinism, I stress the wide variety of pragmatic considerations that, in a specific context, may lead one usefully and legitimately to adopt one conception over another in which determinism may or may not hold.
Abstract-In the strive for knowledge discovery in a world of ever-growing data collection, it is important that even if a dataset is altered to preserve people's privacy, the information in the dataset retains as much quality as possible. In this context, "quality" refers to the accuracy or usefulness of the information retrievable from a dataset. Defining and measuring the loss of information after meeting privacy requirements proves difficult however. Index Terms-Anonymization, data mining, data quality, privacy preserving data mining. I. INTRODUCTIONWithin the Privacy Preserving Data Publishing (PPDP) community, preventing sensitive information about individuals from being inferred is a top priority. This is known as "anonymization". One of the key concepts in PPDP is the trade-off that is inherently present when "anonymizing" data: balancing the increase in security with the decrease in information quality. The majority of previous work has focused on the difficult problem of defining and measuring privacy [1], [2]. This paper explores the other side of the trade-off: information quality. A lot of the time, simplistic measures are developed to provide an estimate of the information quality, or statistical techniques are borrowed from the SDC (Statistical Disclosure Control) community. While robust, these evaluation techniques often fail to capture the nuances that can be present when evaluating specific anonymization tasks, such as generalization 1 1 "Generalization" refers to making a value vaguer, such as changing all occurrences of "apple" and "banana" to "fruit".2 A "dataset" is a two dimensional table where rows represent independent records (tuples) and columns represent various attributes that describe the records and distinguish them from each other.In PPDP, the information quality of an anonymized dataset is most often evaluated by measuring the similarity between the anonymized dataset and the original dataset. If the dataset could be used for a variety of reasons and there is no single purpose in mind, the dataset is evaluated in a way that applies to any scenario -we refer to this as measuring the "dataset quality" or "dataset information loss". These types of techniques are discussed in Section II.Alternatively, if the purpose of the dataset is specific and known, the information quality can be measured in respect to that purpose. Privacy Preserving Data Mining (PPDM; a sect of PPDP) focuses on this type of data, where the quality of the dataset itself is less important than the quality of the outputted data mining 3 results produced from the dataset. Common purposes are classification 4 and clustering 5 [2]. Many patterns in the dataset can be lost after anonymization, even if the dataset itself appears to retain most of its statistical information [6]- [8]. For this reason, information measures have been designed that specifically look at the effect of anonymization on data mining results, and we discuss these in Section III. We call this type of information quality, "data mining quality" or ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.