Visualisation Of Heterogeneous Data With The Generalised Generative .

Transcription

Visualisation of heterogeneous data with the GeneralisedGenerative Topographic MappingMichel F. Randrianandrasana, Shahzad Mumtaz and Ian T. NabneyNonlinearity and Complexity Research Group, Aston University, Birmingham B4 7ET, UK{randrimf, mumtazs, i.t.nabney}@aston.ac.ukKeywords:Data visualisation, GTM, LTM, heterogeneous and missing dataAbstract:Heterogeneous and incomplete datasets are common in many real-world visualisation applications.The probabilistic nature of the Generative Topographic Mapping (GTM), which was originallydeveloped for complete continuous data, can be extended to model heterogeneous (i.e. containingboth continuous and discrete values) and missing data. This paper describes and assesses theresulting model on both synthetic and real-world heterogeneous data with missing values.1INTRODUCTIONType-specific data analysis has been well studied in machine learning1 . In the last couple ofdecades, the need to analyse mixed-type data hasreceived some attention from the machine learning community because of the fact that real-worldprocesses often generate data of mixed-type. Anexample of such mixed-type data could be a hospital’s patient database where typical fields include age (continuous), gender (binary), test results (binary or continuous), height (continuous)etc. In practice a number of ad-hoc methods areused to analyse mixed-type data. For instance,if there is a mixture of continuous and discretevariables, then either all the discrete variables areconverted to some numerical scoring equivalentor, on the other hand, all the continuous variables are discretised. Alternatively, both typesof variables are analysed separately and then theresults are combined using some criteria. According to (Krzanowski, 1983), “All these options involve some element of subjectivity, with possible loss of information, and do not appear verysatisfactory in general”. The ideal general solution for analysing such heterogeneous data isto specify a model that builds a joint distribution with an appropriate noise model for eachtype of feature (for example, a Bernoulli distri-bution for binary features, a multinomial distribution for multi-category features and a Gaussiandistribution for continuous features) and then fitthe model to data (de Leon and Chough, 2013).A multivariate distribution that can modelrandom variables of different types is not available.However, one possible way of jointlymodelling discrete and continuous features isusing a latent variable approach to model thecorrelation between features of different types.For example, a dataset consisting of continuous, binary and multi-category features can bemodelled using a conditional distribution thatis a product of Gaussian, Bernoulli and multinomial distributions. This approach has beenpreviously discussed as a possible extension forGTM (Bishop and Svensen, 1998; Bishop et al.,1998) and PCA (Tipping, 1999) models. Thisidea was implemented in (Yu and Tresp, 2004) tovisualise a mixture of continuous and binary dataon a single continuous latent space by extending probabilistic principal component analysis(PPCA) and was called generalised PPCA (GPPCA). GPPCA is a linear probabilistic modeland uses a variational Expectation-Maximisation(EM) algorithm for parameter estimation. Thereare other latent variable models for mixed-typedatasets but to the best of our knowledge most ofthese are linear models (Moustaki, 1996; mixed- et al., 1997; Dunson, 2000; Teixeira-Pinto andNormand, 2009) and they either use numericaltype-data-analysis-i-overview.html

integration or a sampling approach to handlethe intractable integration for fitting a latentvariable model of this type. It is important tomention that there is not much work reportedin the literature for analysing mixed-type datausing a latent variable formalism (de Leon andChough, 2013). As a generalisation of GTM,a latent trait model (LTM) to handle discretedata was proposed in (Kabán and Girolami,2001): the model used the exponential familyof distributions. In this paper we describe andassess a probabilistic non-linear latent variablemodel to visualise a mixed-type dataset ona single continuous latent space. We shall refer to this model as a generalised GTM (GGTM).binary and multi-category features respectively.The likelihood of each type of feature is given byRR Rp(xRn z, W , β) p(xn µ , β) R 2ββ RR 2 exp µ xn . (4)2π2BBBp(xBn z, W ) p(xn µ ) B YµBd xBnd 1 µBd (1 xBnd ).(5)d 1Cp(xCn z, W ) p(xCn µC ) C Sd YYµCsd xCnsd.(6)d 1 sd 1The treatment of incomplete data for the standard GTM has been explored in (Sun et al., 2002)using an EM approach which estimates the parameters of the mixing components of the GTMand missing values at the same time. The sameapproach is used in this paper to visualise mixedtype data containing missing values with GGTM.Then we compute the product of the likelihoodsfor the Gaussian (equation (4)), Bernoulli (equation (5)) and multinomial (equation (6)) distributions, and find the distribution of x by integratingover the latent variables, z,ZRp(x Ω) p(xRn z, W , β)(7)BBCCp(xn z, W )p(xn z, W )p(z) dz,2where Ω {WR , β, WB , WC } contains all themodel parameters. We use as prior distribution,p(z), a sum of delta functions as for the standardGTM and LTMVisualisation of heterogeneousdata with GGTMThe main goal of a latent variable model isto find a low-dimensional manifold, H, with Mdimensions (usually M 2) for the distributionp(x) of high-dimensional data space, D, with Ddimensions. Latent variable models have beendeveloped to handle a dataset where all thefeatures are of the same type.Suppose that the D-dimensional data spaceis defined by R continuous, B binary and C multi-categorical features respectively. Thelink functions for continuous, binary and multicategory features are defined in equations (1), (2)and (3) respectivelyµR Φ(z)WR .(1)µB g B (Φ(z)WB ) exp(Φ(z)WB ).1 exp(Φ(z)WB )(2)(3)dWe write each observation vector, xn in termsBCof sub-vectors xRn , xn and xn for continuous,K1 Xδ(z zk ).K(8)k 1The data distribution can now be derived fromequations (7) and (8), where we use the same mix1ing co-efficient for all components (i.e. πk K),p(x Ω) KXπk p(x zk , Ω).(9)k 1The log-likelihood of the complete data takes theformL(Ω) NXn 1µCsd g C (Φ(z)wsCd )exp(Φ(z)wsCd ) PSd.0s0 1 exp(Φ(z)wsd )p(z) lnKXπk p(xn zk , Ω).(10)k 1The choice of noise model is related to the corresponding type of data and also the link function mapping from latent to data space (Kabánand Girolami, 2001). The exponential family ofdistributions is used here to model mixed-typedata under the latent variable framework. Fromhere onward to simplify the notation, we use xM ,where M can represent either R, B or C, to indicate the type of feature for a data point x.

2.1An expectation maximization(EM) algorithm for GGTMOur proposed model is based on a mixture of distributions where each component is a product ofGaussian, Bernoulli and/or multinomial distributions. The parameters of the mixture model canbe determined using an EM algorithm: in the Estep, we use the current parameter set, Ω, to compute the posterior probabilities (responsibilities)using Bayes’ theorem,rkn p(zk xn , W) PKπk p(xn zk , W)k0 1,πk0 p(xn zk0 , W)(11)whereRp(xn zk , W) p(xRn zk , W , β)BCCp(xBn zk , W )p(xn zk , W ).(12)We use the maximization of the relative likelihood (Bishop, 1995), which does not require thecomputation of the log of a sum. The relativelikelihood between the old and new set of parameters can be calculated asN XKX Q rkn log p(xn zk , W)p(zk )For other link functions, a Generalised EM(GEM) (McLachlan and Krishnan, 1997) algorithm is used because convergence to the localmaximum is guaranteed without maximizing therelative likelihood (Kabán and Girolami, 2001).A simple gradient-based update can be obtainedfor WM from Equation (14)hi(17) WM ΦT RXM Eg(ΦWM ) ,where this can be used as an inner loop in the M step. The correlations between the dimensions ofφl responsible for preserving the neighbourhoodare required for a topographic organisation giventhat the natural parameter θM is being updatedunder the gradient update of the weight matrixWM (Kabán and Girolami, 2001):N XKXM φ WM ηθdrk0 n φk φTk0 (xM µMkk0 ).kn 1 k0 13Visualisation of missing datawith GGTMThe EM framework supports the treatment ofmissing values in the GGTM model.3.1 Continuous data no RRRR xn θk G θk log(p0 (xn )) The data points x are written as (xo , xm ), where no nnn KN X m and o represent subvectors and submatrices xB θB G θB log(p (xB )) X0nn kkrkn o of the parameters matching the missing and obn CCCC n 1 k 1 ))θ Gθ log(p(x x 0served components of the data (Ghahramani andnnkk log(p(z )) Jordan, 1994). Binary indicator variables ζnkkare introduced to specify which component of the(13)mixture model generated the data point. Bothwhere θkM Φ(zk )WM . In the M-step we maxthe indicator variables ζnk and the missing inputsimize the function Q with respect to each type ofxmn are treated as hidden variables in the EM alweight sub-matrix WM asgorithm. The changes made to the EM algorithmhifor GTM are detailed in (Sun et al., 2002). QTMM ΦRX Eg(ΦW),(14) WM3.2 Discrete datawhere Φ is a K L matrix, R is a K N matrixcalculated using equation (11), XM is an N The missing values are inferred in the E-step us M data sub-matrix and the diagonal matrix Eing the usual posterior means with responsibilitycontains the valuesrkn computed on the observed data,NXKXekk rkn .(15)oDE[xm x,µ] rkn µD(18)nnk,n 1 k 1n 1k 1In the case of an isotropic Gaussian with unitvariance, the link function g(.) is the identity andby setting the derivative to zero we obtain, as inthe standard GTM (Bishop and Svensen, 1998),dR (ΦT EΦ) 1 ΦT RXR .W(16)where D {B or C}. In the M-step, the weightdD is updated first using the completematrix WcD withtraining data and we then update µkDcDdDµ g (Φ(z )W ).(19)kk

4Visualisation quality evaluationmeasuresAlgorithms based on GTM are examples ofunsupervised learning which always give a resultwhen applied to a particular dataset. Thus wecannot tell a priori what is the expected or desired outcome. This makes it difficult to judgewhich method is the best (i.e. tells us the mostabout a certain dataset). Here we use metricsthat measure the degree of local neighbourhoodsimilarity between data space and latent spacewhich can be calculated even if ‘ground truth’ isnot known.4.1Trustworthiness, continuityand mean relative rank errors(MRREs)Two well-known visualisation quality measuresbased on comparing neighbourhoods in the dataspace x and projection space z are trustworthinessand continuity (Venna and Kaski, 2001). A mapping is said to be trustworthy if k-neighbourhoodin the visualised space matches that in the dataspace but if the k-neighbourhood in the dataspace matches that in the visualised space itmaintains continuity. The higher the measure thebetter the visualisation, as this implies that localneighbourhoods are better preserved by the projection. We also use mean relative rank errorswith respect to data and latent spaces (MRRExand MRREz ), which measure the preservationof the rank of the k-nearest neighbours contraryto the trustworthiness and continuity which onlyconsider matches in the k-neighbourhood (Leeand Verleysen, 2008). Note that the lower theMRRE the better the projection quality.5alisation quality measures were computed witha range of neighbourhood sizes (5, 10, 15, 20) andthe mean of these measures over the different sizesand cross-validation runs was computed.5.1Synthetic datasetThe synthetic dataset was generated from anequiprobable mixture of two Gaussians, N (m k , I)2.0(with k 1, 2) with means m1 3.5 , and3.5 3.5 m2 4.5 . A dataset with 9-dimensional bi4.5nary features from four classes was also generated (these classes were not used as inputs to thevisualisation). Both continuous and binary datawere combined to make a dataset of 12 featureswith 2, 800 data points. The visualisation resultsof the complete and missing datasets (10% randomly removed) are shown in Figure 1 and thequality metrics are given in Table 1. We also gen-(a) GTM (training set)(b) GTM (test set)(c) GGTM (training set)(d) GGTM (test set)Experimental resultsThe GGTM was evaluated on both completeand missing synthetic and real-world datasets andcompared with standard GTM for complete data.The weight matrix W was initialised using principal component analysis (PCA). For the metricsin Section 4, we computed pair-wise distances using Hamming distances for the binary featuresand Euclidean distances for the continuous features. For each distance matrix, we divided eachcolumn by its standard deviation. All experiments used 10-fold cross-validation. The visu-(e) GGTMmissing (f) GGTM missing (test(training set)set)Figure 1: GTM and GGTM visualisations of the synthetic 12-dimensional datasets with 3 continuous and9 binary features.erated a dataset with two multi-category featureswith 8 and 16 categories in the first and secondfeatures respectively. We appended the multicategory features to the previous 12-dimensionaldataset and used a 1-of-S encoding scheme for the

completeGGTMmissing0.969 0.0030.964 0.0030.040 0.0000.004 0.0000.949 0.0240.970 0.0130.043 0.0030.038 0.0020.947 0.0270.969 0.0140.042 0.0030.037 0.002Table 1: GTM and GGTM visualisation quality metrics of the 12-dimensional synthetic datasets. Eachfigure represents the average over a 10-fold crossvalidation with one standard deviation on the pleteGGTMcompleteGGTMmissing0.962 0.0040.946 0.0080.045 0.0010.045 0.0010.977 0.0090.980 0.0070.044 0.0010.041 0.0020.973 0.0140.976 0.0130.116 0.0050.132 0.005Table 2: GTM and GGTM visualisation quality metrics of the 14-dimensional synthetic datasets.multi-category features. Labels were based on thefour classes in the binary data.The visualisationresults of the 14-dimensional complete and missing datasets are shown in Figure 2 and the corresponding quality metrics are given in Table 2.The proportion of missing values has also been(a) GTM (training set)(c) GGTM (training set)increased to 30%, 50%, 70% and 90% withoutsubstantially degrading the visualisation qualitymeasures.5.2Hypothyroid datasetThis real-world dataset is publicly available fromthe UCI data repository (Bache and Lichman,2013). The dataset consists of two variable types:15 binary and 6 continuous features. It containsthree classes: primary thyroid, compensated thyroid and normal. The dataset was originally divided into a training set of 3, 772 data points(93 with primary hypothyroid, 191 with compensated hypothyroid and 3488 normal) and atest set of 3, 428 data points (73 with primaryhypothyroid, 177 with compensated hypothyroidand 3178 normal). These training and test setshave been merged prior to running a 10-fold crossvalidation. The visualisation results of the complete and missing datasets are shown in Figure 3and the quality metrics are given in Table 3.(a) GTM (training set)(b) GTM (test set)(c) GGTM (training set)(d) GGTM (test set)(b) GTM (test set)(d) GGTM (test set)(e) GGTMmissing (f) GGTM missing (test(training set)set)Figure 2: GTM and GGTM visualisations of the synthetic 14-dimensional datasets with 3 continuous, 9binary and 2 multi-category features.(e) GGTMmissing (f) GGTM missing (test(training set)set)Figure 3: GTM and GGTM visualisations of thethyroid disease datasets. The cyan circles, red plussign and blue squares represent primary hypothyroid,compensated hypothyroid and normal respectively.

completeGGTMmissing0.718 0.0220.804 0.0170.018 0.0000.016 0.0000.718 0.0150.843 0.0140.019 0.0000.016 0.0000.716 0.0140.835 0.0070.019 0.0000.016 0.000Table 3: GTM and GGTM visualisation quality metrics of the hypothyroid disease datasets.6CONCLUSIONSA generalisation of the GTM to heterogeneous and missing data has been described andassessed in this paper. This involves modellingthe continuous and discrete data with Gaussianand Bernoulli/multinomial distributions respectively. These extensions have been suggestedin (Bishop et al., 1998) but this is the first timethe mathematical details have been worked outand an implementation written and evaluated.Visualisation results for synthetic data usingthe GGTM have shown more compact clustersfor each class compared to the standard GTMwhereas for the real dataset no significant difference was observed. For synthetic datasetswith missing values, GGTM visualisations havegreater compactness for each class. In termsof visualisation quality evaluation metrics, weobserved that for a mix of continuous and binary data, the trustworthiness and MRREx areslightly better for standard GTM compared toGGTM whereas the continuity and MRREz werebetter for GGTM compared to standard GTM.However, for a mix of continuous, binary andmulti-category features, all the quality evaluationmeasures were better for GGTM compared to thestandard GTM. Missing values have caused limited deterioration in results compared to the complete data case.REFERENCESBache, K. and Lichman, M. (2013). UCI machinelearning repository.Bishop, C. M. (1995). Neural networks for patternrecognition. Oxford University Press.Bishop, C. M. and Svensen, M. (1998). GTM: Thegenerative topographic mapping. Neural Compuatation, 10(1):215–234.Bishop, C. M., Svensen, M., and Williams, C. K. I.(1998). Developments of the generative topographic mapping. Neurocomputing, 21(1):203–224.de Leon, A. R. and Chough, K. C. (2013). Analysisof Mixed Data: Methods & Applications. Taylor& Fracis Group. Chapman and Hall/CRC.Dunson, D. B. (2000). Bayesian latent variable models for clustered mixed outcomes. Journal ofthe Royal Statistical Society. Series B (Statistical Methodology), 62(2):355–366.Ghahramani, Z. and Jordan, M. I. (1994). Learningfrom incomplete data. Technical Report AIM1509.Kabán, A. and Girolami, M. (2001). A combined latent class and trait model for the analysis andvisualization of discrete data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(8):859–872.Krzanowski, W. J. (1983). Distance between populations using mixed continuous and categoricalvariables. Biometrika, 70(1):235–243.Lee, J. A. and Verleysen, M. (2008). Rank-basedquality assessment of nonlinear dimensionalityreduction. In ESANN, pages 49–54.McLachlan, G. and Krishnan, T. (1997). The EMalgorithm and extensions. Wiley, New York.Moustaki, I. (1996). A latent trait and a latentclass model for mixed observed variables. BritishJournal of Mathematical and Statistical Psychology, 49(2):313–334.Sammel, M. D., Ryan, L. M., and Legler, J. M.(1997). Latent variable models for mixed discrete and continuous outcomes. Journal of theRoyal Statistical Society. Series B (Methodological), 59(3):667–678.Sun, Y., Tino, P., and Nabney, I. (2002). Visualisation of incomplete data using class information constraints. In Winkler, J. and Niranjan,M., editors, Uncertainty in Geometric Computations, volume 704 of The Springer International Series in Engineering and Computer Science, pages 165–173. Springer US.Teixeira-Pinto, A. and Normand, S. T. (2009). Correlated bivariate continuous and binary outcomes:issues and applications. Statistics in Medicine,28(13):1753–1773.Tipping, M. E. (1999). Probabilistic visualisation ofhigh-dimensional binary data. In Proceedings ofthe 1998 Conference on Advances in Neural Information Processing Systems II, pages 592–598,Cambridge, MA, USA. MIT Press.Venna, J. and Kaski, S. (2001). Neighborhood preservation in nonlinear projection methods: an experimental study. In Proceedings of the International Conference on Artificial Neural Networks, ICANN ’01, pages 485–491, London, UK.Springer-Verlag.Yu, K. and Tresp, V. (2004). Heterogenous data fusion via a probabilistic latent-variable model. InMüller-Schloer, C., Ungerer, T., and Bauer, B.,editors, ARCS, volume 2981 of Lecture Notes inComputer Science, pages 20–30. Springer.

Data visualisation, GTM, LTM, heterogeneous and missing data Abstract: Heterogeneous and incomplete datasets are common in many real-world visualisation applications. The probabilistic nature of the Generative Topographic Mapping (GTM), which was originally developed for complete continuous data, can be extended to model heterogeneous (i.e .