Generalized Statistical Methods for Mixed Exponential Families, Part II: Applications

Cécile Levasseur, Uwe F. Mayer and Kenneth Kreutz-Delgado

Abstract: This work considers the problem of both supervised and unsupervised classification for vector data of mixed types. An important subclass of graphical modeling techniques called Generalized Linear Statistics (GLS) is used to capture the underlying statistical structure of these complex data. The GLS methodology exploits the split between data space and natural parameter space for exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace. It has the critical advantage of allowing one to transfer high-dimensional mixed-type data components to low-dimensional common-type latent variables, which are then, in turn, used to perform effective classification in a much simpler manner using well-known continuous-parameter classical linear techniques. We first demonstrate our ability to learn a GLS generative model in a controlled environment using synthetic data of mixed types. We then illustrate the benefits of making decisions in parameter space, with examples of categorical data (supervised and unsupervised) text categorization and mixed data-type classification and clustering, involving synthetic data and real data sets from the University of California, Irvine (UCI) machine learning repository.

Key words: Generalized Linear Statistics (GLS), exponential family distributions, latent variables, dimensionality reduction, text categorization.


You can download a copy of this paper (about 14 pages).

Mayer24.pdf  This file is in Portable Document Format (784 Kbytes).



[leftarrow]Back

mayer@math.utah.edu
Wed Sep 9 22:51:25 PDT 2009
Last updated: Wed Sep 9 22:51:25 PDT 2009