Clustering And Data Mining In R - Workshop Supplement

Transcription

Clustering and Data Mining in RClustering and Data Mining in RWorkshop SupplementThomas GirkeDecember 10, 2011

Clustering and Data Mining in RIntroductionData PreprocessingData TransformationsDistance MethodsCluster LinkageHierarchical ClusteringApproachesTree CuttingNon-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Clustering and Data Mining in RIntroductionOutlineIntroductionData PreprocessingData TransformationsDistance MethodsCluster LinkageHierarchical ClusteringApproachesTree CuttingNon-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Clustering and Data Mining in RIntroductionWhat is Clustering?IClustering is the classification of data objects into similaritygroups (clusters) according to a defined distance measure.IIt is used in many fields, such as machine learning, datamining, pattern recognition, image analysis, genomics,systems biology, etc.

Clustering and Data Mining in RIntroductionWhy Clustering and Data Mining in R?IEfficient data structures and functions for clustering.IEfficient environment for algorithm prototyping andbenchmarking.IComprehensive set of clustering and machine learning libraries.IStandard for data analysis in many areas.

Clustering and Data Mining in RData PreprocessingOutlineIntroductionData PreprocessingData TransformationsDistance MethodsCluster LinkageHierarchical ClusteringApproachesTree CuttingNon-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Clustering and Data Mining in RData PreprocessingData TransformationsData TransformationsIChoice depends on data set!Center & standardize1. Center: subtract from each vector its mean2. Standardize: devide by standard deviationI Mean 0 and STDEV 1Center & scale with the scale() fuction1. Center: subtract from each vector its mean2. Scale: divide centered vector by their root mean square (rms)vunu 1 Xxi 2xrms tn 1i 1III Mean 0 and STDEV 1Log transformationRank transformation: replace measured values by ranksNo transformation

Clustering and Data Mining in RData PreprocessingDistance MethodsDistance MethodsIList of most common ones!Euclidean distance for two profiles X and Yvu nuXd(X , Y ) t(xi yi )2i 1Disadvantages: not scale invariant, not for negative correlationsIMaximum, Manhattan, Canberra, binary, Minowski, .ICorrelation-based distance: 1 rI Pearson correlation coefficient (PCC)PnPnPnn i 1 xi yi i 1 xi i 1 yir q PPnPnPnn( i 1 xi2 ( i 1 xi )2 )( i 1 yi2 ( i 1 yi )2 )IDisadvantage: outlier sensitiveSpearman correlation coefficient (SCC)Same calculation as PCC but with ranked values!

Clustering and Data Mining in RData PreprocessingCluster LinkageCluster LinkageSingle LinkageComplete LinkageAverage Linkage

Clustering and Data Mining in RHierarchical ClusteringOutlineIntroductionData PreprocessingData TransformationsDistance MethodsCluster LinkageHierarchical ClusteringApproachesTree CuttingNon-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Clustering and Data Mining in RHierarchical ClusteringHierarchical Clustering Steps1. Identify clusters (items) with closest distance2. Join them to new clusters3. Compute distance between clusters (items)4. Return to step 1

Clustering and Data Mining in RHierarchical ClusteringHierarchical Clustering(a)Agglomerative .40.1g1g2g3g4g5

Clustering and Data Mining in RHierarchical ClusteringApproachesHierarchical Clustering Approaches1. Agglomerative approach (bottom-up)hclust() and agnes()2. Divisive approach (top-down)diana()

Clustering and Data Mining in RHierarchical ClusteringTree CuttingTree Cutting to Obtain Discrete Clusters1. Node height in tree2. Number of clusters3. Search tree nodes by distance cutoff

Clustering and Data Mining in RNon-Hierarchical ClusteringOutlineIntroductionData PreprocessingData TransformationsDistance MethodsCluster LinkageHierarchical ClusteringApproachesTree CuttingNon-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Clustering and Data Mining in RNon-Hierarchical ClusteringNon-Hierarchical ClusteringSelected Examples

Clustering and Data Mining in RNon-Hierarchical ClusteringK-MeansK-Means Clustering1. Choose the number of k clusters2. Randomly assign items to the k clusters3. Calculate new centroid for each of the k clusters4. Calculate the distance of all items to the k centroids5. Assign items to closest centroid6. Repeat until clusters assignments are stable

Clustering and Data Mining in RNon-Hierarchical ClusteringK-MeansK-Means(a)XXX(b)XXX(c)XXX

Clustering and Data Mining in RNon-Hierarchical ClusteringPrincipal Component AnalysisPrincipal Component Analysis (PCA)Principal components analysis (PCA) is a data reduction techniquethat allows to simplify multidimensional data sets to 2 or 3dimensions for plotting purposes and visual variance analysis.

Clustering and Data Mining in RNon-Hierarchical ClusteringPrincipal Component AnalysisBasic PCA StepsIICenter (and standardize) dataFirst principal component axisIIIAccross centroid of data cloudDistance of each point to that line is minimized, so that itcrosses the maximum variation of the data cloudSecond principal component axisIIOrthogonal to first principal componentAlong maximum variation in the dataI1st PCA axis becomes x-axis and 2nd PCA axis y-axisIContinue process until the necessary number of principalcomponents is obtained

Clustering and Data Mining in RNon-Hierarchical ClusteringPrincipal Component AnalysisPCA on Two-Dimensional Data Set2nd2nd1st1st

Clustering and Data Mining in RNon-Hierarchical ClusteringPrincipal Component AnalysisIdentifies the Amount of Variability between ComponentsExamplePrincipal ComponentProportion of Variance1st62%2nd34%3rd3%Otherrest1st and 2nd principal components explain 96% of variance.

Clustering and Data Mining in RNon-Hierarchical ClusteringMultidimensional ScalingMultidimensional Scaling (MDS)IAlternative dimensionality reduction approachIRepresents distances in 2D or 3D spaceIStarts from distance matrix (PCA uses data points)

Clustering and Data Mining in RNon-Hierarchical ClusteringBiclusteringBiclusteringFinds in matrix subgroups of rows and columns which are as similar aspossible to each other and as different as possible to the remaining datapoints.Unclustered Clustered

Clustering and Data Mining in RNon-Hierarchical ClusteringMany Additional TechniquesRemember: There Are Many Additional Techniques!Continue with R manual section:”Clustering and Data Mining”

Clustering and Data Mining in R Non-Hierarchical Clustering Principal Component Analysis Basic PCA Steps I Center (and standardize) data I First principal component axis I Accross centroid of data cloud I Distance of each point to that line is minimized, so that it crosses the maximum variation of the data cloud