Active Semi-Supervision For Pairwise Constrained Clustering

Transcription

Proceedings of the SIAM International Conference on Data Mining,(SDM-2004), pp. 333-344, Lake Buena Vista, FL, April, 2004Active Semi-Supervision for Pairwise Constrained ClusteringSugato BasuComputer Sciences,Univ. of Texas at Austin,Austin, TX 78712sugato@cs.utexas.eduArindam BanerjeeElectrical and Computer Eng.,Univ. of Texas at Austin,Austin, TX 78712abanerje@ece.utexas.eduRaymond J. MooneyComputer Sciences,Univ. of Texas at Austin,Austin, TX 78712mooney@cs.utexas.eduproviding class labels, since true labels may be unknownSemi-supervised clustering uses a small amount of super- a priori, while it can be easier to specify whether pairs ofvised data to aid unsupervised learning. One typical ap- points belong to the same cluster or different clusters.We propose a cost function for pairwise constrainedproach specifies a limited number of must-link and cannotclustering(PCC) that can be shown to be the configurationlink constraints between pairs of examples. This paperenergyofa Hidden Markov Random Field (HMRF) overpresents a pairwise constrained clustering framework and athedatawitha well-defined potential function and noisenew method for actively selecting informative pairwise conmodel.Then,the pairwise-constrained clustering problemstraints to get improved clustering performance. The clusbecomesequivalentto finding the HMRF configuration withtering and active learning methods are both easily scalablethehighestposteriorprobability, i.e., minimizing its energy.to large datasets, and can handle very high dimensional data.Wepresentanalgorithmfor solving this problem.Experimental and theoretical results confirm that this activeFurther,inordertomaximizethe utility of the limitedquerying of pairwise constraints significantly improves g, superaccuracy of clustering when given a relatively small das maxiof supervision.mally informative ones rather than chosen at random, if possible [27]. In that case, fewer constraints will be required to1 Introductionsignificantly improve the clustering accuracy. To this end,In many data mining and machine learning tasks, there is awe present a new method for actively selecting good pairlarge supply of unlabeled data but limited labeled data, sincewise constraints for semi-supervised clustering.labeled data can be expensive to generate. Consequently,Both our active learning and pairwise constrained clussemi-supervised learning, learning from a combination oftering algorithms are linear in the size of the data, and henceboth labeled and unlabeled data, has become a topic of sigeasily scalable to large datasets. Our formulation can alsonificant recent interest [6, 20, 30]. More specifically, semihandle very high dimensional data, as our experiments onsupervised clustering, the use of class labels or pairwise context datasets demonstrate.straints on some examples to aid unsupervised clustering,Section 2 outlines the pairwise constrained clusteringhas been the focus of several recent projects [4, 22, 33, 34].framework, and Section 3 presents a refinement of KMeansIn a semi-supervised clustering setting, the focus is onclustering [13, 25], called PCKMeans, that utilizes pairwiseclustering large amounts of unlabeled data in the presence ofconstraints. In Section 4, we present a method for activelya small amount of supervised data. In this setting, we conpicking good constraints by asking queries of the form “Aresider a framework that has pairwise must-link and cannotthese two examples in same or different classes?”. Experilink constraints between points in a dataset (with an assomental results on clustering high-dimensional text data andciated cost of violating each constraint), in addition to havUCI data demonstrate that (1) PCKMeans clustering withing distances between the points. These constraints specifyconstraints performs considerably better than unconstrainedthat two examples must be in the same cluster (must-link) orKMeans clustering, and (2) active PCKMeans achieves sigdifferent clusters (cannot-link) [33]. In real-world unsupernificantly steeper learning curves compared to PCKMeansvised learning tasks, e.g., clustering for speaker identificawith random pairwise queries.tion in a conversation [17], visual correspondence in multiview image processing [7], clustering multi-spectral infor2 Pairwise Constrained Clusteringmation from Mars images [32], etc., considering supervisionCentroid-basedpartitional clustering algorithms (e.g.,in the form of constraints is generally more practical than(with eachKMeans) find a disjoint partitioning Abstract

partition having a centroid ) of a dataset such that the total distance between the points and the clustercentroids is (locally) minimized. We introduce a frameworkfor pairwise constrained clustering (PCC) that has pairwisemust-link and cannot-link constraints [33] between a subsetof points in the dataset (with a cost of violating eachconstraint), in addition to distances between points. Sincecentroid-based clustering cannot handle pairwise constraintsexplicitly, we formulate the goal of clustering in the PCCframework as minimizing a combined objective function:the sum of the total distance between the points and theircluster centroids and the cost of violating the pairwiseconstraints.For the PCC framework with both must-link and cannotlink constraints, letbe the set of must-link pairs suchthatimpliesandshould be assignedto the same cluster, and be the set of cannot-link pairssuch thatimplies andshould be assignedto different clusters. Note that we consider the tuples inand to be order-independent, i.e., and, and so also for . Let be two sets that give the weights corresponding to themust-link constraints inand the cannot-link constraintsinrespectively. Letbe the cluster assignment of a point , where. The cost of violating mustlink and cannot-link constraints is typically quantified bymetrics [23]. We restrict our attention to the uniform metric(also known as the generalized Potts metric), for which thecost of violating a must-linkis given by, i.e., if the must-linked points are assignedto two different clusters, the incurred cost is. Similarly,the cost of violating a cannot-linkis given by, i.e., if the cannot-linked points are assignedto the same cluster, the incurred cost is. Note that hereis the indicator function, with 1 and 0. Using this model, the problem of PCC under must-linkand cannot-link constraints is formulated as minimizing thefollowing objective function, where pointis assigned tothe partition with centroid: must-link constraints, which we extended to the PCC framework by adding the set of cannot-link constraints. Ourproposed pairwise constrained KMeans (PCKMeans) algorithm greedily optimizes pckm using a KMeans-type iteration with a modified cluster-assignment step. For experiments with text documents, we used a variant of KMeanscalled spherical KMeans (SPKMeans) [11] that has computational advantages for sparse high dimensional text datavectors. We will present our algorithm and its motivationbased on KMeans in Section 3, but all of it can be easilyextended for SPKMeans. In the domains that we will beconsidering, e.g., text clustering, different costs for differentpairwise constraints are not available in general, so for simplicity we will be assuming all elements of and to havethe same constant value in (2.1). We will make a detailedstudy of the effect of the choice of in Section 5.Note that KMeans has a running time of, whereis the number of data points, is the number of dimensionsand is the number of clusters. SPKMeans has a runningtime of, where is the number non-zero entries intheinput data matrix. So they are both linear in thesize of the input, making our PCKMeans algorithm for thePCC framework quite efficient. PCKMeans can also handlesparse high-dimensional data (e.g. text, gene micro-array),since it has the computational advantage of SPKMeans inthese domains.! h ! " ! " hlkmfji I# # % # & ' (!) " *-, #. 0 1/ #2 43!5 " 6 & 879 : ; !5 *-, #. ?#2 43**-, @BA C DE3 ! *-, F G #IH DE3KJML OJML N PRQ X -Y JML X4ZS[ \ Q ! LU*-TW, V# c/ # 3 [ \ Q ! *-, # e# 3S LB] S TbaS LB] S T%d(2.1)pckmMinimizing this objective function can be shown to be equivalent to maximizing the posterior configuration probability of a Hidden Markov Random Field, details of whichare given in Appendix A.1. The mathematical formulation of this framework was motivated by the metric labeling problem and the generalized Potts model [7, 23], forwhich Kleinberg et al. [23] proposed an approximation algorithm. Their formulation only considers the setofN#i3 Clustering Algorithm!fg h i Given a set of data points , a set of must-link constraints, a set of cannot-link constraints , the weight of theconstraintsand the number of clusters to form , wepropose an algorithm PCKMeans that finds a disjointpartitioning of (with each partition having acentroid ) such that pckm is (locally) minimized.In our previous work [4], we had observed that properinitialization of centroid-based algorithms like KMeans using the provided semi-supervision in the form of labeledpoints greatly improves clustering performance. Here, instead of labeled points, we are given supervision in the formof constraints on pairs of points – in this case also, our goalin the initialization step will be to get good estimates of thecluster centroids from the pairwise constraints.In the initialization step of PCKMeans, we take thetransitive closure of the must-link constraints [33] and augby adding these entailed constraints.1 Notement the setthat our current model considers consistent (non-noisy) constraints, but it can also be extended to handle inconsistent(noisy) constraints, as discussed in Section 7. Let the number of connected components in the augmented setbe , which are used to create neighborhood sets .!nNnsutwv x'v yzv { v oqp %pr1 A note on complexity: the transitive closure and constraint augmentation takesoperations.

opopop opFor every pair of neighborhoodsand that have atleast one cannot-link between them, we add cannot-link constraints between every pair of points inandandaugment the cannot-link set by these entailed constraints,again assuming consistency of constraints. From now on, wewill refer to the augmented must-link and cannot-link sets asand respectively.Note that the neighborhood sets, which containthe neighborhood information inferred from the must-linkconstraints and are unchanged during the iterations of thealgorithm, are different from the partition sets , whichcontain the cluster partitioning information and get updatedat each iteration of the algorithm.After this preprocessing step we get neighborhood sets, which are used to initialize the cluster centroids. If, where is the required number of clusters, weselect the neighborhood sets of largest size and initializethe cluster centers with the centroids of these sets. If , we initialize cluster centers with the centroids ofthe neighborhood sets. We then look for a point that isconnected by cannot-links to every neighborhood set. If such cluster.a point exists, it is used to initialize theIf there are any more cluster centroids left uninitialized,we initialize them by random perturbations of the globalcentroid of [14]. opn o n p %prn nnn [ O !#"% &' (*), % .-#/1032%4,5 6%07698!2;: 0 3?A@CB DFEHG%EI J KML N/1032%4,OQP R0NSUT : V9WX2; R07Y76;: 0 [Z\@]B DFE ND a b7G L N/1032%4cW 6% 2%0NSUT : VdW12; ! 07Y 6%: 0 [ef@CB gDFER 7D a b7G L #P O9h!/XYi2%4,WXT P R0N/ Y7 ij L k /X: l m 0n2%4,W12 R0NY 6%: #07 [oqpr ), % is3:g ut 2;: 03jv8 6;YR07: 07: 2; : l9B%?nw G;xw J K 2;4c?y RP W mv07m 6 02;h t /MW 07: z;/.4 P W10N: 2 f{, M}H 1 :g q gTg2 WX6;T T b,Od: : Od: / 5 p M *F p, : 07:g6%T :g X/.W1T P R07/XY X 6 p WXY7/ 6 07/ 0Nm /Q /X: l m#h!2;Y7m 2 2 5 iB G% X J K 4 Y72;OyZ 6% !5ve h p N2;YN0n0Nm / : 5 :gWX/ , : 5 / WXYN/M6; N: lQ N: /.2%4c q W p : 4c f j: : 0N:g6;T : X/ B 3w 7 G xw J K k : 07m W1/ 07YN2 :g5 [2%4[B q ;G Xx J K/ Tg N/ : 4c j: : 0N:g6;T : X/ B w 7 G; w J K k : 07m W1/ 07YN2 :g5 [2%4[B G% X J K: 4 8!2 : 0iD W 6% 2%0NSUT : V;/M59072d6%T T / : l;m#h!2 YNm 2#2 5 iB G% X J K: : 07:g6;T : X/i 3 K k : 07m D M : : 0N:g6;T : X/iYN/ O 6;: : l WXT P R0N/ Y ¡6%0nY 6% 5 2;O pc i/X8 / 6 03P 07: T[ §% 1 Hª X 6 p«" &#&# # % &% ;²! ,³. N N: l; v/M6;W m 5 6 0 6Q8 2;: 0iD 0N2d0Nm /KKWXT P R07/XY3µ ¶9 :Hp /;pc N/10 ? w · b L 4 2;Y.µ ¶3@ 6;YNlcOd: Df» .w · ¹¹ ººwµÇ@ Æ È a1É ¼ o ½µv@ËÈ a É b¼ o ½ ¾ ¿ ¾ ÀR UÁ;ÂÄà Š¾ ¿ ¾ ÀR UÁ%Ê.K à ÅK½D G;xw J K h pÌ & ! Í "; Í! "% &# [B 3w · G;xw J KiÎ BÏ Ð[Ö Ñ Ò Ó#ÔuÕ Ï ¾#Á Ð¡Ö Ñ Ò Ó#ÔuÕ W p Î u ¼ bFigure 1: PCKMeans algorithmThe algorithm PCKMeans alternates between the cluster assignment and centroid estimation steps (see Figure 1).In the cluster assignment step of PCKMeans, every point is assigned to a cluster such that it minimizes the sumof the distance of to the cluster centroid and the cost ofconstraint violations incurred by that cluster assignment (byequivalently satisfying as many must-links and cannot-linksas possible by the assignment). Note that the cluster assignment step is order-dependent, since the subsets ofandassociated with each cluster may change with the assignment of a point. For our experiments, we consider a randomordering of the points in this assignment step. The centroidre-estimation step is the same as KMeans, i.e., each cluster centroid is calculated by taking the mean of the pointsassigned to that cluster. NL EMMA 1. The algorithm PCKMeans converges to a localØ Ù1Ú Ûminimum of.X Y X Z [ ! \ S ] S Tba *-, m / Z2# 3 [ ! \ S ] S T%d *-, c #2 43 NNProof. For analyzing the cluster assignment step, let usmoves to a new clusterconsider (2.1). Each pointonly if the component (yÜ ) of pckm , contributed by theÜpoint , decreases. So when all points are given their newassignment, pckm will decrease or remain the same.For analyzing the centroid re-estimation step, let usconsider an equivalent form of (2.1):(3.2)N PO Q Q X Y X ZS L TWV[ \ Q ! " *-, # / # 3 [ \ Q ! " *-, # # 3S L ] S TbaS L ] S T%d pckm nÝ Z S LUTWV X )Y X Z NS L TWV X Y X Z NZEach cluster centroid is re-estimated by taking themean of the points in the partition , which minimizesthe component ( Ü ) of pckm in (3.2)nÝcontributed by the partition . The pairwise constraintsdo not take in part in this step because the constraints arenot an explicit function of the centroid. As a result only thefirst term ( Ü Ü ) of pckm in (3.2) isÝminimized in this step.Hence the objective function decreases after every cluster assignment and centroid re-estimation step till convergence, implying that the PCKMeans algorithm will convergeto a local minimum of pckm .N4 Active Learning AlgorithmIn the semi-supervised setting, getting labels on data pointpairs may be expensive. In this section, we discuss an activelearning scheme in the PCC setting in order to improveclustering performance with as few queries as possible.Formally, the scheme has access to a noiseless oracle thatcan assign a must-link or cannot-link label on a given pair, and it can pose a constant number of queries to the6 8

oracle.2In order to get pairwise constraints that are more informative than random in the PCC model, we have developed an active learning scheme for selecting pairwise constraints using the farthest-first traversal scheme. The basicidea of farthest-first traversal of a set of points is to findpoints such that they are far from each other. In farthest-firsttraversal, we first select a starting point at random, choosethe next point to be farthest from it and add it to the traversed set, then pick the following point farthest from the traversed set (using the standard notion of distance from a set: ), and so on. Farthest-first traversal gives an efficient approximation of the - problem [18], and has also been used to construct hierarchicalclusterings with performance guarantees at each level of thehierarchy [10]. For our data model (see Appendix A.2), weprove another interesting property of farthest-first traversal(see Appendix A.4) that justifies its use for active learning.In [4], it was observed that initializing KMeans withcentroids estimated from a set of labeled examples for eachcluster gives significant performance improvements. Under certain generative model-based assumptions, one canconnect the mixture of Gaussians model to the KMeansmodel [21]. A direct calculation using Chernoff boundsshows that if a particular cluster (with an underlying Gaussian model) with true centroid is seeded with points(drawn independently at random from the correspondingGaussian distribution) and the estimated centroid is , theni 6 T i [9 Y (4.3) 4D h @UD A D "!Zwhere. Thus, the deviation of the centroid #estimates falls exponentially with the number of seeds, andhence seeding results in good initial centroids. Since goodinitial centroids are very critical for the success of greedyalgorithms such as KMeans, we follow the same principlefor the pairwise case: we will try to get as many points(proportional to the actual cluster size) as possible per clusterby asking pairwise queries, so that PCKMeans is initializedfrom a very good set of centroids.The proposed active learning scheme has two phases.The first phase explores the given data to get pairwise disjoint non-null neighborhoods, each belonging to a differentcluster in the underlying clustering of the data, as fast as possible. Note that even if there is only one point per neighborhood, this neighborhood structure defines a correct skeleton of the underlying cluster. For this phase, we proposean algorithm Explore that essentially uses the farthest-firstscheme to form appropriate queries for getting the requiredpairwise disjoint neighborhoods.2 Theoracle can also give a don’t-know response to a query, in whichcase that response is ignored (pair not considered as a constraint) and thatquery is not posed again later.At the end of Explore, at least one point has beenobtained per cluster. The remaining queries are used toconsolidate this structure. The cluster skeleton obtainedfrom Explore is used to initialize pairwise disjoint nonnull neighborhoods . Then, given any point notin any of the existing neighborhoods, we will have to askat mostqueries by pairing up with a memberfrom each of the disjoint neighborhoodsto find out theneighborhood to which belongs. This principle forms thesecond phase of our active learning algorithm, and we callthe algorithm for this phase Consolidate. In this phase,we are able to get the correct cluster label of by askingat mostqueries. So,pairwise labels areequivalent to a single pointwise label in the worst case.Now, we present the details of the algorithms for performing the exploration and the consolidation.Y O %o p p opY O Y O %'&)( *-, .)/ 021436587:9 ; : 8?@ ACB D /E3GF HJILKEM NPO IQOSR K:TVUWIYXLZ\[ ] a cb d e2f g O:hJhJH XQX IiK"OEUjK:kYOEhJlmHLIQnPO IOEU-XpoqHJkYX R-OETVkQorTsXpHut:vPH kQTVH X g U v8wSx HJkrKEMyhzlmv-X{IQH kQXL g IiKEIYO laU vPwSx HJkKEM tWvPHJkQTmH Xr}S D / B D / 3G 4 'N8TsXm iK:TVUWIqUPH TV :n x K:kQn8K K8NPXr] :bE e2f hJK:kikQH XpR K:U-N8TVUP IQK IQnPH IikQvPH hJlmvPXpIQH kiTmUP KEMyZ orT IQn O IGlVH O:X{IqK:UPH R-K:TmUWIrR H kUPH TV :n x K:kinPK K8N / 02*a 63 y {UPT IQTmOElmTV HE 2XpHJIGOElml UPHJTm :n x KEkQnPK K8NPXr] WbE J e2f IiK"UWvPlml 6 6TshY SIQnPH -kQXpIrR K:TmU:IL O ILkYOEU-N8KEw g O:NPN IQK Sf g j nPTVlmH tWvPH kiTmH XqOEkiH O lmlVK o H N OEU-N 4 ¡ R K:TVUWIqM OEkpIQnPH X{IrMskQK:w HJ 8TsX{IQTVUP UPHJTm :nWx K:kQnPK K8NPXr] Wb e2fT M g x tWvPH ki TmUP g §TsXqhJOEUPUPKEIi lmTVUP :H NSIQK OElVl HJ 8TsX{IQTVUP UPH TV :n x K:kinPK K8NPX ª g X{IYOEkiILO UPH o«U8H Tm :nWx K:kQnPK K8N"orT IQnj H lmXiHO:NPNj ¡IQK IQnPH UPH TV :n x K:kinPK K8N"oLT IQn'ornPTshYn T ILTmXqwSv-X{Ii lmTVUP :H NFigure 2: Algorithm Explore4.1 Explore In the exploration phase, we use a very interesting property of the farthest-first traversal. Given a set ofdisjoint balls of unequal size in a metric space, we showthat the farthest-first scheme is sure to get one point fromeach of the balls in a reasonably small number of attempts(see Appendix A.4). Hence, our algorithm Explore (seeFigure 2) uses farthest-first traversal for getting a skeletonstructure of the neighborhoods. In Explore, while queriesare still allowed and pairwise disjoint neighborhoods havenot yet been found, the point farthest from all the existingneighborhoods is chosen as a candidate for starting a newneighborhood. Queries are posed by pairing with a random point from each of the existing neighborhoods. If iscannot-linked to all the existing neighborhoods, a new neighborhood is started with . If a must-link is obtained for a particular neighborhood, is added to that neighborhood. Thiscontinues till the algorithm runs out of queries or pairwisedisjoint neighborhoods have been found. In the latter case,the active learning scheme enters the consolidation phase.

"!# &%(' *) -,/. 0(1"2 ,3254 . 6879,;:/ @?BADCFE GC H IBJ 2 K-K- L:3: ,M.N2 7O. P;2 K-QR /,3S"2 ,2 7 :UTV -P;: 4 2 68P3TW6X:U ZY [" P368 L: J 7 [ \5] -PW. 0 K QR[ : ,3 P3:/a J ,M. ,;2 QD7 ["\5] -P. 0(Y9[" -P36R :Wb J a 1 6X:RcM. 6879,V7" 68d S ] . PMS". . 1": K-. PMP3 L:M4 . 7 1 6R7"de,3.5,3PM[" K-Q8[ :U,M P3687"df. 0 gT/6 ,3SO2 ,/QR 2 :U,V. 7" h4 . 6879,W4 P/7" 6Rd S ] . P3S". . 1jik ' %(' *a 1 6X:RcM. 6879, 7" 68d S ] . PMS". . 1":VK-. PMP3 L:U4 . 7 1 6R7 dl,M.m,MS" e,MP3[" K-Q8[ :U,M P3687"df. 0 gT/6 ,3S S"6Rd S" PV7 ["\f] PW. 0 4 . 6879,;:V4 P7" 68d S ] . PMS". . 1jinpo L Dqr s irtu:U,368\v2B,3 ZK- 79,MP3. 6R1":/? wVx E yx H I . 0( L2 K3SO. 0&,MS" l7" -6Rd S9] . P3S". . 1":z i({ S"68QR hY9[" PM6R L:V2 PM l2 QRQ8. T L1z 2 i}P;2 7"1 . \NQ8 f4"6XK; N254 . 6R79,VA 7". ,/6R7v,3S" e - 6R:U,M6R7"d 7" 68d S ] . PMS". . 1":z ] i :M. PU, ,3S" e6R7 1 6RK- :/ TW68,MSO687 K-P3 2 :M687"d51 6R:U,32 7 K- L:e A wWx - z K i 0 . P s ,M.vaY9[" PM vA TW68,3S L2 K3SO. 0&,MS" e7" 6Rd S ] . P3S". . 1": 687O:M. PM,M L1v. P;1 P,368QRQ 2f\f[ :U,M QR687" 56X: . ] ,32 6R7" 1 J 2 1"1 A ,3.5,3S 2 ,W7" -6Rd S ] . P3S". . 1can. When it has obtained all the clusters, it will not haveany way of knowing this. However, from this point onwards,for every farthest-first it draws from the dataset, it will always find a neighborhood that is must-linked to it. Hence,after discovering all the clusters, Explore will essentiallyconsolidate the clusters too. However, when is known,it makes sense to invoke Consolidate since (1) it addspoints to clusters at a faster rate than Explore, and (2) itpicks random samples following the underlying data distribution, which have certain nice properties in terms of estimating good centroids (e.g., Chernoff bounds on the centroidestimates, as shown in (4.3)), that the samples obtained usingfarthest-first traversal need not have. 5 ExperimentsFigure 3: Algorithm ConsolidateIn this section, we outline the details of our experiments.4.2 Consolidation The consolidation phase starts when atleast one point has been obtained from each of the clusters. The basic idea in the consolidation phase is that sincewe now have points from all the clusters, the proper neighborhood of any random point can be determined within amaximum ofqueries. The queries will be formedby taking a point from each of the neighborhoods in turnand asking for the label on the pair till a must-linkhas been obtained. We will either get a must-link reply inqueries, else if we get cannot-link replies for thequeries to theneighborhoods, we can infer that the point is must-linked to the remaining neighborhood. Note that it is practical to sort the neighborhoods inincreasing order of the distance of their centroids from sothat the correct must-link neighborhood for is encounteredsooner in the querying process. The outline of the algorithmConsolidate is given in Figure 3.Both Explore and Consolidate add points to theclusters at a good rate. It can be shown using the result inAppendix A.4 that the Explore phase gets at least onepointfrom each of the underlying clusters in maximume queries.When the active scheme has finished runningExplore and is running Consolidate, it can also beshown using a generalization of the coupon collector’s problem (Appendix A.4) that with high probability it will getV one new point from each cluster in approximatelyqueries. Consolidate therefore adds points to clusters at L , whicha faster rate than Explore by a factor ofis validated by our experiments in Section 5. Note that thisanalysis is for balanced clusters, but a similar analysis withunbalanced clusters gives the same improvement factor.Finally, we briefly address the case when the right number of clusters is not known to the clustering algorithm,so that is also unknown to the active learning scheme. Inthis case, only Explore is used while queries are allowed.Explore will keep discovering new clusters as fast as itY O YYOO Y O 6 Zfg Z5.1 Datasets In our experiments with high-dimensionaltext documents, we used datasets created from the 20Newsgroups collection.3 It has messages collected from20 different Usenet newsgroups, 1000 messages from eachnewsgroup. From the original dataset, a reduced datasetNews-all20 was created by taking a random subsample of100 documents from each of the 20 newsgroups – this subsample is a more difficult dataset to cluster than the original20 Newsgroups, since each cluster has fewer documents.News-all20 has 2000 points in 16089 dimensions. By selecting 3 categories from the reduced dataset News-all20,two other datasets were created: News-sim3 that consists of 3 newsgroups on similar topics (comp.graphics,comp.os.ms-windows, comp.windows.x) with significantcluster overlap, and News-diff3 that consists of 3 newsgroups on different topics (alt.atheism, rec.sport.baseball,sci.space) with well-separated clusters. News-sim3 has300 points in 3225 dimensions, while News-diff3 had300 points in 3251 dimensions. Another dataset we usedin our experiments is a subset of Classic3 [11] containing400 documents – 100 Cranfield documents from aeronauticalsystem papers, 100 Medline documents from medical journals, and 200 Cisi documents from information retrieval papers. This Classic3-subset dataset was specifically designed to create clusters of unequal size, and has 400 pointsin 2897 dimensions. Similarities between data points in thetext datasets were computed using cosine similarity, following SPKMeans [11]. SPKMeans maximizes the averagecosine similarity between points and cluster centroids, sothat the objective function monotonically increases with every iteration till convergence. All the text datasets were preprocessed following the methodology of Dhillon et al. [11].For experiments on low-dimensional data, we selectedthe UCI dataset Iris [5], which has 150 points in 4 dimen3 http://www.ai.mit.edu/people/jrennie/20Newsgroups

5.2 Clustering evaluation We have used three metrics forcluster evaluation: normalized mutual information (NMI),pairwise F-measure, and objective function.Normalized mutual information (NMI) determines theamount of statistical information shared by the random variables representing the cluster assignments and the userlabeled class assignments of the data points. We computedNMI following the methodology of Strehl et al. [31]. NMImeasures how closely the clustering algorithm could reconstruct the underlying label distribution in the data. If isthe random variable denoting the cluster assignments of thepoints, and is the random variable denoting the underlyingclass labels on the points, then the NMI measure is definedas 01000Number of Pairwise ConstraintsFigure 4: Comparison of NMI values on News-sim310.90.8Pairwise F measure 0.80.2 o q 0 [ 0 P 0 Y 0 q 0 0 q PCKMeansSeeded KMeansConstrained KMeansActive PCKMeansActive Seeded KMeansActive Constrained KMeans0.9Normalized Mutual Informationsions. The Euclidean metric was used for computing distances between points in this dataset, following KMeans. Inthis case, the objective function, which is the average squaredEuclidean distance between points and cluster centroids, decreases at every iteration till convergence. The Iris datasetwas not pre-processed in any way.PCKMeansSeeded KMeansConstrained KMeansActive PCKMeansActive Seeded KMeansActive Constrained KMeans0.70.60.5 whereis the mutual informa0.4tion between the random variablesand ,is theShannon entropy of , and is the conditional en0.3tropy of given .0.2Pairwise F-measure is defined as the harmonic mean of0.101002003004005006007008009001000pairwise precision and recall, the traditional information reNumber of Pairwise Constraintstrieval measures adapted for evaluating clustering by considering pairs of points – for every pair of points that do not haveFigure 5: Comparison of pairwise F-measure values onexplicit constraints between them, the decision to cluster thisNews-sim3pair into same or different clusters is considered to be correctif it matches with the underlying class labeling available forthe points. Pairwise F-measure is defined as: "! # %& ' ), %0(*- %& ' , -/.0 12 12, 1&

sugato@cs.utexas.edu Arindam Banerjee Electrical and Computer Eng., Univ. of Texas at Austin, Austin, TX 78712 abanerje@ece.utexas.edu Raymond J. Mooney Computer Sciences, Univ. of Texas at Austin, Austin, TX 78712 mooney@cs.utexas.edu Abstract Semi-supervised clustering uses a small amount of super-vised data to aid unsupervised learning. One .