Learning Distance Functions - University Of Texas At Austin

Transcription

Learning distance functionsXin SuiCS395T Visual Recognition and SearchThe University of Texas at Austin

Outline IntroductionLearning one Mahalanobis distance metricLearning multiple distance functionsLearning one classifier represented distancefunction Discussion Points

Outline IntroductionLearning one Mahalanobis distance metricLearning multiple distance functionsLearning one classifier represented distancefunction Discussion Points

Distance function vs. Distance Metric Distance Metric: Satisfy non-negativity, symmetry and triangleinequation Distance Function: May not satisfy one or more requirements fordistance metric More general than distance metric

Constraints Pairwise constraints Equivalence constraints Image i and image j issimilar Inequivalence constraints Image i and image j isnot similarRed line: equivalence constraintsBlue line: in-equivalence constraints Triplet constraints Image j is moresimilar to image i thanimage kConstraints are the supervised knowledge for the distance learning methods

Why not labels? Sometimes constraints are easier to get thanlabels faces extracted from successive frames in a videoin roughly the same location can be assumed tocome from the same person

Why not labels? Sometimes constraints are easier to get than labels Distributed Teaching Constraints are given by teachers who don’t coordinatewith each othergiven by teacher T3given by teacher T1given by teacher T2

Why not labels? Sometimes constraints are easier to get thanlabels Search engine logsclickedclickedNot clickedMore similar

Problem Given a set of constraints Learn one or more distance functions for the inputspace of data from that preserves the distance relationamong the training data pairs

Importance Many machine learning algorithms, heavily rely onthe distance functions for the input data patterns. e.g.kNN The learned functions can significantly improve theperformance in classification, clustering and retrievaltasks:e.g. KNN classifier, spectral clustering, contentbased image retrieval (CBIR).

Outline Introduction Learning one Mahalanobis distancemetric Global methods Local methods Learning one classifier represented distancefunction Discussion Points

Parameterized Mahalanobis DistanceMetricx, y: the feature vectors of two objects,for example, a words-of-bag representation of an image

Parameterized Mahalanobis DistanceMetricTo be a metric, A must be semi-definite

Parameterized Mahalanobis DistanceMetricIt is equivalent to finding a rescaling of a data that replaceseach point x withand applying standard Euclideandistancex

Parameterized Mahalanobis DistanceMetric If A I, Euclidean distance If A is diagonal, this corresponds to learning ametric in which the different axes are givendifferent “weights”

Global Methods Try to satisfy all the constraints simultaneously keep all the data points within the same classes close, whileseparating all the data points from different classes

Distance Metric Learning, with Application toClustering with Side-information [Eric Xing . Et,2003]

A Graphical View(a) Data Dist. of the original dataset (b) Data scaled by the global metricKeep all the data points within the same classes closeSeparate all the data points from different classes(the figure from [Eric Xing . Et, 2003])

Pairwise Constraints A set of Equivalence constraints A set of In-equivalence constraints

The Approach Formulate as a constrained convex programming problem Minimize the distance between the data pairs in S Subject to data pairs in D are well separated Solving an iterative gradient ascent algorithmensure that A does not collapse thedataset to a single point

Another example(a)Original data(b) Rescaling by learneddiagonal A(c) rescaling by learnedfull A(the figure from [Eric Xing . Et, 2003])

RCA Learning a Mahalanobis Metric fromEquivalence Constraints [BAR HILLEL, et al.2005]

RCA(Relevant Component Analysis) Basic Ideas Changes the feature space by assigning largeweights to “relevant dimensions” and low weightsto “irrelevant dimensions”. These “relevant dimensions” are estimated usingequivalence constraints

Another view of equivalenceconstraints: chunkletsEquivalence constraintsChunklets formed by applying transitive closureEstimate the within class covariancedimensions correspond to large with-in covariance are not relevantdimensions correspond to small with-in covariance are relevant

Synthetic Gaussian data(a) The fully labeled data set with 3 classes.(b) Same data unlabeled; classes' structure is less evident.(c) The set of chunklets that are provided to the RCA algorithm(d) The centered chunklets, and their empirical covariance.(e) The RCA transformation applied to the chunklets. (centered)(f) The original data after applying the RCA transformation.(BAR HILLEL, et al. 2005)

RCA Algorithm Sum of in-chunklet covariance matrices for ppoints ink chunkletsn 1 k j C (x ji m j )(x ji m j )T , chunklet j : {x }n j , with mean m jji i 1p j 1 i 1 Compute the whitening transformationassociated with, and apply it to thedata points, Xnew WX (The whitening transformation W assigns lowerweights to directions of large variability)

Applying to facesTop: facial images of two subjects under different lighting conditions.Bottom: the same images from the top row after applying PCA and RCA and thenreconstructing the imagesRCA dramatically reduces the effect of different lighting conditions,and the reconstructed images of each person look very similar to eachother.[Bar-Hillel, et al. , 2005]

Comparing Xing’s method and RCA Xing’s method Use both equivalence constraints and in-equivalenceconstraints The iterative gradient ascent algorithm leading to highcomputational load and is sensitive to parameter tuning Does not explicitly exploit the transitivity property ofpositive equivalence constraints RCA Only use equivalence constraints explicitly exploit the transitivity property of positiveequivalence constraints Low computational load Empirically show that RCA is similar or better than Xing’method using UCI data

Problems with Global Method Satisfying some constraints may be conflict tosatisfying other constraints

Multimodal data distributions(a)Data Dist. of the originaldataset(b) Data scaled by the global metricMultimodal data distributions prevent global distance metrics fromsimultaneously satisfying constraints on within-class compactness andbetween-class separability.[[Yang, et al, AAAI, 2006] ]

Local Methods Not try to satisfy all the constraints, but try tosatisfy the local constraints

LMNN Large Margin Nearest Neighbor Based DistanceMetric Learning [Weinberger et al., 2005]

K-Nearest Neighbor ClassificationWe only care the nearest k neighbors

LMNN Learns a Mahanalobis distance metric, which Enforces the k-nearest neighbors belong to the same class Enforces examples from different classes are separated bya large margin

Approach Formulated as a optimization problem Solving using semi-definite programming method

Cost FunctionDistance Function:Another form of Mahalanobis Distance:

Cost FunctionTarget Neighbors: identified as the k-nearest neighbors,determined by Euclidean distance, that share the same label 1When K 2 1 0 0

Cost FunctionPenalizes large distances between inputs and target neighbors. Inother words, making similar neighbors close 1 1 0 0

Cost Function

Cost FunctionFor inputs and target neighborsIt is equal to 1

Approach-Cost FunctionFor inputs and target neighborsIt is equal to 1indicates if and hassame label. So For input and neighbors havingdifferent labels, it is equal to 1

Approach-Cost FunctionFor inputs and target neighborsIt is equal to 1indicates if and hassame label. So For input and neighbors havingdifferent labels, it is equal to 1

Approach-Cost FunctionFor inputs and target neighborsIt is equal to 1indicates if and hassame label. So For input and neighbors havingdifferent labels, it is equal to 1Distance between inputs and target neighbors

Approach-Cost FunctionFor inputs and target neighborsIt is equal to 1indicates if and hassame label. So For input and neighbors havingdifferent labels, it is equal to 1Distance between inputs and target neighborsDistance between input and neighborswith different labels

Cost FunctionDifferently labeled neighbors lie outside thesmaller radius with a margin of at least one unitdistance

Test on Face RecognitionImages from the AT&T face recognition data base, kNN classification (k 3) Top row: an image correctly recognized with Mahalanobis distances, but not with Euclidean distances Middle row: correct match among the k 3 nearest neighbors according to Mahalanobis distance, butnot Euclidean distance. Bottom row: incorrect match among the k 3 nearest neighbors according to Euclidean distance, butnot Mahalanobis distance.[K. Weinberger et al., 2005]

ILMNN An Invariant Large Margin Nearest NeighborClassifier [Mudigonda, et al, 2007]

Transformation InvarianceSame after rotation transformation and thickness transformationWhen do classification, the classifier needs to regard the two imagesas the same image.Figure from [Simard et al., 1998]

ILMNN An extension to LMNN[K.Weinberger et al.,2005] Add regularization to LMNN to avoid overfitting Incorporating invariance using PolynomialTransformations (Such as Euclidean, Similarity,Affine, usually used in computer vision)

Green Diamond is test point,(a) Trajectories defined by rotating the points by an angle -5 θ 5 (b) Mapped trajectories After learning[Mudigonda, et al, 2007]

Outline IntroductionLearning Mahalanobis distance metricLearning multiple distance functionsLearning one classifier represented distancefunction Conclusion

Learning Globally-Consistent Local DistanceFunctions for Shape-Based Image Retrieval andClassification[Frome, et al., 2007] The slides are adapted from Frome’ talk on ICCV 2007(http://www.cs.berkeley.edu/ afrome/papers/iccv2007 talk.pdf)

Globally-Consistent Local DistanceFunctions [Frome, et al., 2007] Previous methods only learn one distancefunction for all images, while this method learnsone distance function for each image From this perspective, it’s a local distance functionlearning method while all the previous methodsare global

Using triplet constraints

Patch-based features Different images may have different number offeatures.

[Frome, et al., 2007]

[Frome, et al., 2007]

Good Result

Bad Results

Summary Extremely local, having more ability to learn agood distance function for complex feature space Too many weights to learn Too many constraints

Outline IntroductionLearning one Mahalanobis distance metricLearning multiple distance functionsLearning one classifier representeddistance function Discussion Points

DistBoost T. Hertz, A. Bar-Hillel and D. Weinshall,Learning Distance Functions for ImageRetrieval, in Proceedings of the IEEEConference on Computer Vision and PatternRecognition (CVPR) 2004 [Hertz, et al, 2004]

DistBoostDistanceFunction[0,1]Can be seen as a binary classifier (Adaboost)The constraints are the labeled training examplesfor the classifier.

Figure from [Hertz, Ph.D Thesis, 2006]

Results Each row presents a query image and its first 5 nearest neighborscomparing DistBoost and normalized L1 CCV distance

Results Each row presents a query image and its first 5 nearest neighborscomparing DistBoost and normalized L1 CCV distance

Results Each row presents a query image and its first 5 nearest neighborscomparing DistBoost and normalized L1 CCV distance

Summary Another view of distance function learning A global method, since it try to satisfy all theconstraints Can learn non-linear distance functions

Discussion Points Currently most of the work focus on learninglinear distance function, how can we learn nonlinear distance function? Learning one distance function for every imageis really good? Will lead to overfitting? Shouldwe learn higher level distance function? The triplet constraints are huge for [Frome,2007], how to improve the triplet selectionmethod?

References [Hertz, et al, 2004]T. Hertz, A. Bar-Hillel and D. Weinshall, Learning Distance Functions for ImageRetrieval, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2004[Hertz, PhD Thesis, 2006] Learning Distance Functions: Algorithms and Applications, HebrewUniversity, 2006[Bar-Hillel, et al, 2005]A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, Learning a MahalanobisMetric from Equivalence Constraints, in Journal of Machine Learning Research (JMLR), 2005[Frome, et al, 2007]A. Frome, Y. Singer, F. Sha, J. Malik , Learning Globally-Consistent Local DistanceFunctions for Shape-Based Image Retrieval and Classification, in Proceedings of the IEEE InternationalConference on Computer Vision (ICCV), 2007[Mudigonda, et al, 2007]P. Mudigonda, P. Torr, and A. Zisserman , Invariant Large Margin NearestNeighbor Classifier, in Proceedings of the IEEE International Conference on Computer Vision (ICCV),2007[Yang, et al, 2006]L. Yang, Distance Metric Learning: A Comprehensive Survey, Michigan StateUniversity, 2006[Yang, et al, AAAI, 2006] L. Yang, R. Jin, R. Sukthankar, Y. Liu. An Efficient Algorithm for LocalDistance Metric Learning. (Oral Prensentation) Proceedings of AAAI, 2006[Weinberger et al., 2005] K. Q.Weinberger, J. Blitzer, and L. K. Saul. Distance metriclearning for largemargin nearest neighbor classification. In NIPS, 2005[Xing et al., 2002] E. Xing, A. Ng, and M. Jordan. Distancemetric learning with application to clusteringwith side-information. In NIPS, 2002.[Simard et al., 1998]P. Simard, Y. LeCun, J. Denker, and B. Victorri. Transformation invariance inpattern recognition, tangent distance and tangent propagation. In G. Orr and M. K., editors, NeuralNetworks: Tricks of the trade. Springer, 1998.

[Hertz, PhD Thesis, 2006] Learning Distance Functions: Algorithms and Applications, Hebrew University, 2006 [Bar-Hillel, et al, 2005]A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, Learning a Mahalanobis Metric from Equivalence Constraints, in Journal o