On subgroup discovery in numerical domains

Großkreutz, Henrik; Rüping, Stefan

doi:10.1007/s10618-009-0136-3

Cited by 64 publications

(35 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Atzmüller and Lemmerich (2009) and Grosskreutz and Rüping (2009) that proposes a range of quality measures for numeric domains (Pieters et al 2010). Our experiments with multiple target attributes are of course based on our previous work on EMM (Leman et al 2008;Duivesteijn et al 2010;van Leeuwen 2010).…”

Section: Related Workmentioning

confidence: 99%

Diverse subgroup set discovery

Leeuwen

Knobbe

2012

Data Min Knowl Disc

View full text Add to dashboard Cite

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.

show abstract

Section: Related Workmentioning

confidence: 99%

Diverse subgroup set discovery

Leeuwen

Knobbe

2012

Data Min Knowl Disc

View full text Add to dashboard Cite

show abstract

“…In [7], the discretization happens within the algorithm and relies on a property of the function measuring subgroup quality to merge basic intervals in a bottom-up fashion. Yet, the cut points for the basic intervals are determined as a pre-processing step in a way that is not necessarily optimal with respect to their later use.…”

Section: Data Discretizationmentioning

confidence: 99%

“…Each of the initial pairs is extended in turn (lines [3][4][5][6][7][8][9][10][11][12][13][14], selecting at each step the k i most promising candidates (line 12). A value of 4 for k i , for example, enables to keep the first step candidates for both operators and both sides.…”

Section: Putting It All Togethermentioning

confidence: 99%

From Black and White to Full Colour: Extending Redescription Mining Outside the Boolean World

Galbrun¹,

Miettinen²

2011

Proceedings of the 2011 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche-finding, is of much importance in biology.But current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a prerequisite, entailing a possibly harmful loss of information. In nichefinding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot.In this paper, we extend redescription mining to realvalued data using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behaviour of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods.

show abstract

“…In "Subgroup discovery in numerical domains" Grosskreutz and Rüping (2009) propose an improved way of identifying subgroups for continuous-attribute data with an extensive comparison to existing approaches.…”

Section: Papers Appearing In the Journal Of Data Mining And Knowledgementioning

confidence: 99%