Nizar Bouguila scite author profile

This paper presents an unsupervised approach for feature selection and extraction in mixtures of generalized Dirichlet (GD) distributions. Our method defines a new mixture model that is able to extract independent and non-Gaussian features without loss of accuracy. The proposed model is learned using the Expectation-Maximization algorithm by minimizing the message length of the data set. Experimental results show the merits of the proposed methodology in the categorization of object images.

show abstract

Unsupervised Learning of a Finite Mixture Model Based on the Dirichlet Distribution and Its Application

Bouguila

Ziou

Vaillancourt

2004

IEEE Trans. on Image Process.

167

View full text Add to dashboard Cite

This paper presents an unsupervised algorithm for learning a finite mixture model from multivariate data. This mixture model is based on the Dirichlet distribution, which offers high flexibility for modeling data. The proposed approach for estimating the parameters of a Dirichlet mixture is based on the maximum likelihood (ML) and Fisher scoring methods. Experimental results are presented for the following applications: estimation of artificial histograms, summarization of image databases for efficient retrieval, and human skin color modeling and its application to skin detection in multimedia databases.

show abstract

High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length

Bouguila

Ziou

2007

IEEE Trans. Pattern Anal. Machine Intell.

137

View full text Add to dashboard Cite

We consider the problem of determining the structure of high-dimensional data, without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model which best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of web pages, and texture database summarization for efficient retrieval.

show abstract

Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications

2006

View full text Add to dashboard Cite

This paper deals with a Bayesian analysis of a finite Beta mixture model. We present approximation method to evaluate the posterior distribution and Bayes estimators by Gibbs sampling, relying on the missing data structure of the mixture model. Experimental results concern contextual and non-contextual evaluations. The non-contextual evaluation is based on synthetic histograms, while the contextual one model the class-conditional densities of pattern-recognition data sets. The Beta mixture is also applied to estimate the parameters of SAR images histograms.

show abstract

A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

Chen

Tang

Bouguila

et al. 2018

Pattern Recognition

126

View full text Add to dashboard Cite

Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is "optimal" for large scale data. For example, DB-SCAN requires O(n 2 ) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n 2 ) algorithm in very high dimension such that 2 D >> n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n * log(n)) with the help of indexing technique, and the best case is O(n) if

show abstract

Network Anomaly Intrusion Detection Using a Nonparametric Bayesian Approach and Feature Selection

et al. 2019

View full text Add to dashboard Cite

Anomaly-based intrusion detection systems (IDSs) have been deployed to monitor network activity and to protect systems and the Internet of Things (IoT) devices from attacks (or intrusions). The problem with these systems is that they generate a huge amount of inappropriate false alarms whenever abnormal activities are detected and they are not too flexible for a complex environment. The high-level rate of the generated false alarms reduces the performance of IDS against cyber-attacks and makes the tasks of the security analyst particularly difficult and the management of intrusion detection process computationally expensive. We study here one of the challenging aspects of computer and network security and we propose to build a detection model for both known and unknown intrusions (or anomaly detection) via a novel nonparametric Bayesian model. The design of our framework can be extended easily to be adequate for IoT technology and notably for intelligent smart city web-based applications. In our method, we learn the patterns of the activities (both normal and anomalous) through a Bayesian-based MCMC inference for infinite bounded generalized Gaussian mixture models. Contrary to classic clustering methods, our approach does not need to specify the number of clusters, takes into consideration the uncertainty via the introduction of prior knowledge for the parameters of the model, and permits to solve problems related to over-and under-fitting. In order to get better clustering performance, feature weights, model's parameters, and the number of clusters are estimated simultaneously and automatically. The developed approach was evaluated using popular data sets. The obtained results demonstrate the efficiency of our approach in detecting various attacks. INDEX TERMS Intrusion detection systems (IDS), anomaly intrusion detection, infinite mixture models, bounded generalized Gaussian models, Bayesian inference, Markov chain Monte Carlo (MCMC).

show abstract

A hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture

Bouguila

Ziou

2006

IEEE Trans. on Image Process.

View full text Add to dashboard Cite

This paper applies a robust statistical scheme to the problem of unsupervised learning of high-dimensional data. We develop, analyze, and apply a new finite mixture model based on a generalization of the Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. We show that the mathematical properties of this distribution allow high-dimensional modeling without requiring dimensionality reduction and, thus, without a loss of information. This makes the generalized Dirichlet distribution more practical and useful. We propose a hybrid stochastic expectation maximization algorithm (HSEM) to estimate the parameters of the generalized Dirichlet mixture. The algorithm is called stochastic because it contains a step in which the data elements are assigned randomly to components in order to avoid convergence to a saddle point. The adjective "hybrid" is justified by the introduction of a Newton-Raphson step. Moreover, the HSEM algorithm autonomously selects the number of components by the introduction of an agglomerative term. The performance of our method is tested by the classification of several pattern-recognition data sets. The generalized Dirichlet mixture is also applied to the problems of image restoration, image object recognition and texture image database summarization for efficient retrieval. For the texture image summarization problem, results are reported for the Vistex texture image database from the MIT Media Lab.

show abstract

Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach

Bouguila

Ziou

2006

IEEE Trans. Knowl. Data Eng.

116

View full text Add to dashboard Cite

This paper proposes an unsupervised algorithm for learning a finite Dirichlet mixture model. An important part of the unsupervised learning problem is determining the number of clusters which best describe the data. We extend the minimum message length (MML) principle to determine the number of clusters in the case of Dirichlet mixtures. Parameter estimation is done by the expectation-maximization algorithm. The resulting method is validated for one-dimensional and multidimensional data. For the onedimensional data, the experiments concern artificial and real SAR image histograms. The validation for multidimensional data involves synthetic data and two real applications: shadow detection in images and summarization of texture image databases for efficient retrieval. A comparison with results obtained for other selection criteria is provided.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Nizar Bouguila

A Hybrid Feature Extraction Selection Approach for High-Dimensional Non-Gaussian Data Clustering

Unsupervised Learning of a Finite Mixture Model Based on the Dirichlet Distribution and Its Application

High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length

Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications

A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

Network Anomaly Intrusion Detection Using a Nonparametric Bayesian Approach and Feature Selection

A hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture

Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach

Contact Info

Product

Resources

About