Data stream clustering

Silva, Jonathan Andrade; Faria, Elaine R.; Barros, Rodrigo C.; Hruschka, Eduardo R.; Carvalho, André C. P. L. F. de; Gama, João

doi:10.1145/2522968.2522981

Cited by 444 publications

(287 citation statements)

References 87 publications

Supporting

Mentioning

241

Contrasting

Unclassified

Order By: Relevance

“…In order to train the Maximum Entropy model with a very limited training dataset, we need to convert attributes that have continuous numeric values into discrete ones. There has been a lot of research done on continuous feature discretization field [27][28][29][30][31][32]. Methods for discretization are broadly classified into Supervised vs. Unsupervised, Global vs. Local, and Static vs.…”

Section: K-means Clusteringmentioning

confidence: 99%

Predicting the Outcome of NBA Playoffs Based on the Maximum Entropy Principle

Cheng

Zhang

Kyebambe

et al. 2016

Entropy

View full text Add to dashboard Cite

Predicting the outcome of National Basketball Association (NBA) matches poses a challenging problem of interest to the research community as well as the general public. In this article, we formalize the problem of predicting NBA game results as a classification problem and apply the principle of Maximum Entropy to construct an NBA Maximum Entropy (NBAME) model that fits to discrete statistics for NBA games, and then predict the outcomes of NBA playoffs using the model. Our results reveal that the model is able to predict the winning team with 74.4% accuracy, outperforming other classical machine learning algorithms that could only afford a maximum prediction accuracy of 70.6% in the experiments that we performed.

show abstract

Section: K-means Clusteringmentioning

confidence: 99%

Predicting the Outcome of NBA Playoffs Based on the Maximum Entropy Principle

Cheng

Zhang

Kyebambe

et al. 2016

Entropy

View full text Add to dashboard Cite

show abstract

“…Most of the conventional learning techniques assume that there is a static dataset generated by an unknown yet stationary probability distribution, which can be stored and analyzed in multiple steps. Nevertheless, none of the latter assumptions are verifiable in several streaming scenarios and the development of new learners must account for several constraints [1,2,10,21,22,30,33]:…”

Section: Learning From Data Streamsmentioning

confidence: 99%

“…Nonetheless, none of the latter assumptions can be verified in the streaming scenario and the development of algorithms must account for several constraints [2,21,33]. Firstly, instances arrive continuously over time and there is no control over the order that they arrive nor how they should be processed.…”

Section: Concept Driftmentioning

confidence: 99%

“…Another important trait of kNN refers to the dimensionality of the problem, either in static or streaming scenarios. As discussed in a variety of works [15,33], Euclidian distances fail on representing in effective fashion the distance between points (instances) in a high-dimensional space, phenomenon named "curse of dimensionality".…”

Section: H(y |X) This Can Be Derived From H(x Y ) = H(x) + H(y |X) mentioning

confidence: 99%

See 1 more Smart Citation

On Dynamic Feature Weighting for Feature Drifting Data Streams

Barddal

Gomes

Enembreck

et al. 2016

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

“…Required information for forming clusters is provided by core micro-clusters and outlier micro-clusters. The major drawback is the computational cost is more [8].D-Stream is grid based clustering method. In the online phase each data record is mapped to a grid.…”

mentioning

confidence: 99%

Improved Macro-clusters generation using Top-k shared Micro-clusters in Data Streams

Praneetha¹

2017

IJARCSSE

View full text Add to dashboard Cite

Now-a-days data streams or information streams are gigantic and quick changing. The usage of information streams can fluctuate from basic logical, scientific applications to vital business and money related ones. The useful information is abstracted from the stream and represented in the form of micro-clusters in the online phase. In offline phase micro-clusters are merged to form the macro clusters. DBSTREAM technique captures the density between micro-clusters by means of a shared density graph in the online phase. The density data in this graph is then used in reclustering for improving the formation of clusters but DBSTREAM takes more time in handling the corrupted data points In this paper an early pruning algorithm is used before pre-processing of information and a bloom filter is used for recognizing the corrupted information. Our experiments on real time datasets shows that using this approach improves the efficiency of macro-clusters by 90% and increases the generation of more number of micro-clusters within in a short time.Index: Data Stream Clustering, Density based Clustering. I.I INTRODUCTION Clustering is a standard or imperative system of exploratory information mining, which isolates an arrangement of information into a few gatherings (additionally called clusters) such that items in same gathering are more comparable with each other in some sense than with the items in different gatherings. Data streams are the continuous flow of data and its size has no bounds [2][10]. Many applications produce this type of streaming data like GPS data from vehicles, web click stream data, computer network monitoring, readings from sensors etc. Data stream clustering is done for better understanding of data.Cluster algorithms and their parameter settings depend on the individual data sets. Data stream clustering algorithms process the data quickly by providing timely results, detects whether new clusters should appear or disappear and also identifies the outliers.Clustering of data streams can be done by using grid based algorithms like D-Stream [1] or density based algorithms like DBSTREAM [2] or partitioning based algorithms like k-means. The main or primary goal of this paper is to improve the quality of final clusters and to reduce the time in generating the micro-clusters. II.RELATED WORK In the application point of view one-pass clustering algorithms are not useful as the outdated data makes the cluster quality poor. CluStream is an effective and efficient method characterizes the data streams in different time horizons. The micro-clusters are stored as snapshots in pyramidal time window [5]. But cannot find arbitrary shaped clusters and cannot handle outliers. [6].Density based clustering algorithm, DBSCAN is used to find the clusters of arbitrary shapes in large spatial Databases with noise and it requires only one input parameter. It counts the number of data points and estimates its density by using eps, midpoints parameters and identifies the core, border and noise points. [3].The disadvant...

show abstract

Data stream clustering

Cited by 444 publications

References 87 publications

Predicting the Outcome of NBA Playoffs Based on the Maximum Entropy Principle

Predicting the Outcome of NBA Playoffs Based on the Maximum Entropy Principle

On Dynamic Feature Weighting for Feature Drifting Data Streams

Improved Macro-clusters generation using Top-k shared Micro-clusters in Data Streams

Contact Info

Product

Resources

About