The UCI KDD archive of large data sets for data mining research and experimentation

Bay, Stephen D.; Kibler, Dennis; Pazzani, Michael J.; Smyth, Padhraic

doi:10.1145/380995.381030

Cited by 203 publications

(116 citation statements)

References 5 publications

Supporting

Mentioning

106

Contrasting

Unclassified

Order By: Relevance

“…We run our experiments on 29 benchmark data sets from UCI machine learning repository (Blake and Merz 1998) and KDD archive (Bay 1999). This experimental suite comprises 3 parts.…”

Section: Datamentioning

confidence: 99%

Discretization for naive-Bayes learning: managing discretization bias and variance

2008

View full text Add to dashboard Cite

Quantitative attributes are usually discretized in Naive-Bayes learning. We establish simple conditions under which discretization is equivalent to use of the true probability density function during naive-Bayes learning. The use of different discretization techniques can be expected to affect the classification bias and variance of generated naive-Bayes classifiers, effects we name discretization bias and variance. We argue that by properly managing discretization bias and variance, we can effectively reduce naive-Bayes classification error. In particular, we supply insights into managing discretization bias and variance by adjusting the number of intervals and the number of training instances contained in each interval. We accordingly propose proportional discretization and fixed frequency discretization, two efficient unsupervised discretization methods that are able to effectively manage discretization bias and variance. We evaluate our new techniques against four key discretization methods for naive-Bayes classifiers. The experimental results support our theoretical analyses by showing that with statistically significant frequency, naive-Bayes classifiers trained on data discretized by our new methods are able to achieve lower classification error than those trained on data discretized by current established discretization methods.

show abstract

“…We run our experiments on 29 benchmark data sets from UCI machine learning repository (Blake and Merz 1998) and KDD archive (Bay 1999). This experimental suite comprises 3 parts.…”

Section: Datamentioning

confidence: 99%

Discretization for naive-Bayes learning: managing discretization bias and variance

2008

View full text Add to dashboard Cite

show abstract

“…In order to explain how the techniques introduced in this paper can practically improve the efficiency of rule discovery, we do our experiments by applying the new algorithm to 10 databases chosen from the UCI Machine Learning repository [6] and the UCI KDD archives [3]. The databases are described in table 3.…”

Section: Experimental Evaluationsmentioning

confidence: 99%

Efficiently Identifying Exploratory Rules’ Significance

Huang

Webb

2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. How to efficiently discard potentially uninteresting rules in exploratory rule discovery is one of the important research foci in data mining. Many researchers have presented algorithms to automatically remove potentially uninteresting rules utilizing background knowledge and user-specified constraints. Identifying the significance of exploratory rules using a significance test is desirable for removing rules that may appear interesting by chance, hence providing the users with a more compact set of resulting rules. However, applying statistical tests to identify significant rules requires considerable computation and data access in order to obtain the necessary statistics. The situation gets worse as the size of the database increases. In this paper, we propose two approaches for improving the efficiency of significant exploratory rule discovery. We also evaluate the experimental effect in impact rule discovery which is suitable for discovering exploratory rules in very large, dense databases.

show abstract

“…where D 1 (i, y) is given by (1), and a(θ, i) represents the number of examples which are squashed into the leaf i. (c) Data Squashing Construct a novel SF tree from the training examples.…”

Section: Of the Final Round T And A Classification Model (B) Update mentioning

confidence: 99%

“…We employed the KDD Cup 1999 data set [1], from which we produced several data sets. Since it is difficult to introduce a distance measure of data squashing for a nominal attribute and binary attributes can be misleading in calculating a distance, we deleted such attributes before the experiments.…”

Section: Experimental Conditionmentioning

confidence: 99%

Iterative Data Squashing for Boosting Based on a Distribution-Sensitive Distance

Choki

Suzuki

2002

Principles of Data Mining and Knowledge Discovery

View full text Add to dashboard Cite

Abstract. This paper proposes, for boosting, a novel method which prevents deterioration of accuracy inherent to data squashing methods. Boosting, which constructs a highly accurate classification model by combining multiple classification models, requires long computational time. Data squashing, which speeds-up a learning method by abstracting the training data set to a smaller data set, typically lowers accuracy. Our SB (Squashing-Boosting) loop, based on a distribution-sensitive distance, alternates data squashing and boosting, and iteratively refines an SF (Squashed-Feature) tree, which provides an appropriately squashed data set. Experimental evaluation with artificial data sets and the KDD Cup 1999 data set clearly shows superiority of our method compared with conventional methods. We have also empirically evaluated our distance measure as well as our SF tree, and found them superior to alternatives.

show abstract

The UCI KDD archive of large data sets for data mining research and experimentation

Cited by 203 publications

References 5 publications

Discretization for naive-Bayes learning: managing discretization bias and variance

Discretization for naive-Bayes learning: managing discretization bias and variance

Efficiently Identifying Exploratory Rules’ Significance

Iterative Data Squashing for Boosting Based on a Distribution-Sensitive Distance

Contact Info

Product

Resources

About