We consider the problem of both supervised and unsupervised classification for multidimensional data that are nongaussian and of mixed types (continuous and/or discrete). An important subclass of graphical model techniques called Generalized Linear Statistics (GLS) is used to capture the underlying statistical structure of these complex data. GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace. Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques. The benefits of decision making in parameter space is illustrated with examples of categorical data text categorization and mixed-type data classification. As a text document preprocessing tool, an extension from binary to categorical data of the conditional mutual information maximization based feature selection algorithm is presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.