This paper presents an unsupervised approach for feature selection and extraction in mixtures of generalized Dirichlet (GD) distributions. Our method defines a new mixture model that is able to extract independent and non-Gaussian features without loss of accuracy. The proposed model is learned using the Expectation-Maximization algorithm by minimizing the message length of the data set. Experimental results show the merits of the proposed methodology in the categorization of object images.
This paper presents an unsupervised algorithm for learning a finite mixture model from multivariate data. This mixture model is based on the Dirichlet distribution, which offers high flexibility for modeling data. The proposed approach for estimating the parameters of a Dirichlet mixture is based on the maximum likelihood (ML) and Fisher scoring methods. Experimental results are presented for the following applications: estimation of artificial histograms, summarization of image databases for efficient retrieval, and human skin color modeling and its application to skin detection in multimedia databases.
We consider the problem of determining the structure of high-dimensional data, without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model which best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of web pages, and texture database summarization for efficient retrieval.
This paper deals with a Bayesian analysis of a finite Beta mixture model. We present approximation method to evaluate the posterior distribution and Bayes estimators by Gibbs sampling, relying on the missing data structure of the mixture model. Experimental results concern contextual and non-contextual evaluations. The non-contextual evaluation is based on synthetic histograms, while the contextual one model the class-conditional densities of pattern-recognition data sets. The Beta mixture is also applied to estimate the parameters of SAR images histograms.
Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is "optimal" for large scale data. For example, DB-SCAN requires O(n 2 ) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n 2 ) algorithm in very high dimension such that 2 D >> n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n * log(n)) with the help of indexing technique, and the best case is O(n) if
Anomaly-based intrusion detection systems (IDSs) have been deployed to monitor network activity and to protect systems and the Internet of Things (IoT) devices from attacks (or intrusions). The problem with these systems is that they generate a huge amount of inappropriate false alarms whenever abnormal activities are detected and they are not too flexible for a complex environment. The high-level rate of the generated false alarms reduces the performance of IDS against cyber-attacks and makes the tasks of the security analyst particularly difficult and the management of intrusion detection process computationally expensive. We study here one of the challenging aspects of computer and network security and we propose to build a detection model for both known and unknown intrusions (or anomaly detection) via a novel nonparametric Bayesian model. The design of our framework can be extended easily to be adequate for IoT technology and notably for intelligent smart city web-based applications. In our method, we learn the patterns of the activities (both normal and anomalous) through a Bayesian-based MCMC inference for infinite bounded generalized Gaussian mixture models. Contrary to classic clustering methods, our approach does not need to specify the number of clusters, takes into consideration the uncertainty via the introduction of prior knowledge for the parameters of the model, and permits to solve problems related to over-and under-fitting. In order to get better clustering performance, feature weights, model's parameters, and the number of clusters are estimated simultaneously and automatically. The developed approach was evaluated using popular data sets. The obtained results demonstrate the efficiency of our approach in detecting various attacks. INDEX TERMS Intrusion detection systems (IDS), anomaly intrusion detection, infinite mixture models, bounded generalized Gaussian models, Bayesian inference, Markov chain Monte Carlo (MCMC).
This paper applies a robust statistical scheme to the problem of unsupervised learning of high-dimensional data. We develop, analyze, and apply a new finite mixture model based on a generalization of the Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. We show that the mathematical properties of this distribution allow high-dimensional modeling without requiring dimensionality reduction and, thus, without a loss of information. This makes the generalized Dirichlet distribution more practical and useful. We propose a hybrid stochastic expectation maximization algorithm (HSEM) to estimate the parameters of the generalized Dirichlet mixture. The algorithm is called stochastic because it contains a step in which the data elements are assigned randomly to components in order to avoid convergence to a saddle point. The adjective "hybrid" is justified by the introduction of a Newton-Raphson step. Moreover, the HSEM algorithm autonomously selects the number of components by the introduction of an agglomerative term. The performance of our method is tested by the classification of several pattern-recognition data sets. The generalized Dirichlet mixture is also applied to the problems of image restoration, image object recognition and texture image database summarization for efficient retrieval. For the texture image summarization problem, results are reported for the Vistex texture image database from the MIT Media Lab.
This paper proposes an unsupervised algorithm for learning a finite Dirichlet mixture model. An important part of the unsupervised learning problem is determining the number of clusters which best describe the data. We extend the minimum message length (MML) principle to determine the number of clusters in the case of Dirichlet mixtures. Parameter estimation is done by the expectation-maximization algorithm. The resulting method is validated for one-dimensional and multidimensional data. For the onedimensional data, the experiments concern artificial and real SAR image histograms. The validation for multidimensional data involves synthetic data and two real applications: shadow detection in images and summarization of texture image databases for efficient retrieval. A comparison with results obtained for other selection criteria is provided.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.