Janis Kalofolias scite author profile

Janis Kalofolias

5Publications

17Citation Statements Received

168Citation Statements Given

How they've been cited

How they cite others

128

168

Affiliations

Helmholtz Center for Information Security, Max Planck Institute for Informatics, Max Planck Society

Publications

Order By: Most citations

Naming the Most Anomalous Cluster in Hilbert Space for Structures with Attribute Information

Kalofolias

Vreeken

2022

AAAI

View full text Add to dashboard Cite

We consider datasets consisting of arbitrarily structured entities (e.g., molecules, sequences, graphs, etc) whose similarity can be assessed with a reproducing ker- nel (or a family thereof). These entities are assumed to additionally have a set of named attributes (e.g.: number_of_atoms, stock_price, etc). These attributes can be used to classify the structured entities in discrete sets (e.g., ‘number_of_atoms < 3’, ‘stock_price ≤ 100’, etc) and can effectively serve as Boolean predicates. Our goal is to use this side-information to provide explain- able kernel-based clustering. To this end, we propose a method which is able to find among all possible entity subsets that can be described as a conjunction of the available predicates either a) the optimal cluster within the Reproducing Kernel Hilbert Space, or b) the most anomalous subset within the same space. Our method works employs combinatorial optimisation via an adaptation of the Maximum-Mean-Discrepancy measure that captures the above intuition. Finally, we propose a criterion to select the optimal one out of a family of kernels in a way that preserves the available side-information. We provide several real world datasets that demonstrate the usefulness of our proposed method.

show abstract

From sets of good redescriptions to good sets of redescriptions

2018

View full text Add to dashboard Cite

Redescription mining aims at finding pairs of queries over data variables that describe roughly the same set of observations. These redescriptions can be used to obtain different views on the same set of entities. So far, redescription mining methods have aimed at listing all redescriptions supported by the data. Such an approach can result in many redundant redescriptions and hinder the user's ability to understand the overall characteristics of the data.In this work, we present an approach to identify and remove the redundant redescriptions, that is, an approach to move from a set of good redescriptions to a good set of redescriptions. We measure the redundancy of a redescription using a framework inspired by the concept of subjective interestingness based on maximum-entropy distributions as proposed by De Bie in 2011. Redescriptions, however, generate specific requirements on the framework, and our solution differs significantly from the existing ones. Notably, our approach can handle disjunctions and conjunctions in the queries, whereas the existing approaches are limited only to conjunctive queries. Our framework can also handle data with Boolean, nominal, or real-valued data, possibly containing missing values, making it applicable to a wide variety of data sets. Our experiments show that our framework can efficiently reduce the redundancy even on large data sets.

show abstract

SUSAN: The Structural Similarity Random Walk Kernel

Kalofolias¹,

Welke²,

Vreeken³

2021

View full text Add to dashboard Cite

Omen: discovering sequential patterns with reliable prediction delays

2022

View full text Add to dashboard Cite

Suppose we are given a discrete-valued time series $$X $$ X of observed events and an equally long binary sequence $$Y $$ Y that indicates whether something of interest happened at that particular point in time. We consider the problem of mining serial episodes, sequential patterns allowing for gaps, from $$X $$ X that reliably predict those interesting events. With reliable we mean patterns that not only predict that an interesting event is likely to follow, but in particular that we can also accurately tell how how long until that event will happen. In other words, we are specifically interested in patterns with a highly skewed distribution of delays between pattern occurrences and predicted events. As it is unlikely that a single pattern can explain a complex real-world progress, we are after the smallest, least redundant set of such patterns that together explain the interesting events well. We formally define this problem in terms of the Minimum Description Length principle, by which we identify the best patterns as those that describe the occurrences of interesting events $$Y $$ Y most succinctly given the data over $$X $$ X . As neither discovering the optimal explanation of $$Y $$ Y given a set of patterns, nor the discovery of optimal pattern set are problems that allow for straightforward optimization, we break the problem in two and propose effective heuristics for both. Through extensive empirical evaluation, we show that both our main method, Omen, and its fast approximation fOmen, work well in practice and both quantitatively and qualitatively beat the state of the art.

show abstract

Efficiently Discovering Locally Exceptional Yet Globally Representative Subgroups

Kalofolias

Boley

Vreeken

2017

View full text Add to dashboard Cite

Abstract-Subgroup discovery is a local pattern mining technique to find interpretable descriptions of sub-populations that stand out on a given target variable. That is, these subpopulations are exceptional with regard to the global distribution. In this paper we argue that in many applications, such as scientific discovery, subgroups are only useful if they are additionally representative of the global distribution with regard to a control variable. That is, when the distribution of this control variable is the same, or almost the same, as over the whole data.We formalise this objective function and give an efficient algorithm to compute its tight optimistic estimator for the case of a numeric target and a binary control variable. This enables us to use the branch-and-bound framework to efficiently discover the top-k subgroups that are both exceptional as well as representative. Experimental evaluation on a wide range of datasets shows that with this algorithm we discover meaningful representative patterns and are up to orders of magnitude faster in terms of node evaluations as well as time.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.