Vignette: Reimagining The Analog Photo Album

Transcription

Vignette: Reimagining the Analog Photo AlbumDavid Eng, Andrew Lim, Pavitra Rengarajan1 Abstractprincipal components to consider. We letX {x(1) , x(2) , ., x(n) } be the training set of face images,where x(i) Rd .Although the smartphone has emerged as the mostconvenient device on which to capture photos, it lacks thetools to effectively view and organize these photos. Giventhe casual nature of smartphone photography, we propose asystem that can streamline and simplify the post-captureprocess. We propose Vignette, a content managementsystem for smartphone photography that improves thevisual story by automating the organization of a user'sphoto album. We hope to provide dynamic organization ofphoto albums that supports user queries including requestsfor images of a particular person or genre. The main taskwe consider is clustering photos of a particular individual.1. Compute the mean of X.µ 1nPni 1x(i)2. Compute the covariance matrix of X.P(i) µ)(x(i) µ)TS n1 ni 1 (x3. Solve for the eigenvalues λ(i) and eigenvectors v (i) ofX.Sv (i) λ(i) v (i) , i 1, 2, ., n2 IntroductionThe basic structure of the system involves a pipeline thatfirst sorts a photo album into images those that containspeople and those do not. Further processing is then carriedout on the photos of people to cluster the images intosub-albums with photos of a particular person. This wouldallow the user to perform some basic queries over what wasa previously untagged set of photos.4. Choose the eigenvectors that correspond to the klargest eigenvalues.5. We project the test images into the PCA subspace anduse the features in this reduced dimensional space totrain a multiclass SVM.5.2 FisherfacesFor the purposes of comparison to PCA using the Eigenfacesalgorithm, we consider a second dimensionality reductiontechnique based on linear discriminant analysis (LDA). Wenow describe the Fisherface algorithm. Let us assume thatour training set contains images of n different people. Welet X {X1 , X2 , ., Xn }, where Xi is a set of images ofperson i. We compute the total mean of all the images:PPµ Pn 1 Xi ni 1x Xi x.Figure 1: Photo Organization Pipelinei 1We consider a pipeline that involves both unsupervised andsupervised learning. Early on, the system has no notion ofany labels. Thus, on the arrival of a new image, we performunsupervised clustering to add it to either an existingcluster or a new cluster. At some point, however, there maybe a sufficient number of different clusters with high enoughdensity that we could use as labeled training data for asupervised facial recognition algorithm.We also compute a between-class scatter matrix defined as:PTSB ni 1 Xi (µi µ)(µi µ)and a within-class scatter matrix defined as:PPTSW ni 1x Xi (x µi )(x µi )The Fisherface algorithm then solves the followingoptimization to compute a projection Wopt that maximizesthe class separability.4 Face DetectionWopt arg maxWAt the moment, we have focused more on the facialrecognition and clustering task. As our face detector, we usea Haar feature-based cascade classifier trained on variousfacial features including the position and shapes of eyes, thenose, and facial outline. W T SB W W T SW W [w1 w2 .wk ]The xi ’s correspond to the generalized eigenvectors of SBand SW that correspond to the k largest eigenvalues. Wenote that there can be at most n 1 nonzero generalizedeigenvalues. Thus, we have an upper bound on k of n 1.5 Feature ExtractionWe note, however, that the within-class scatter matrix Swwill always be singular. Thus, we perform PCA andreformulate the optimization problem as:5.1 EigenfacesWe consider the Eigenface algorithm, which performs aPrincipal Component Analysis (PCA) on the training set offace images to generate a set of basis vectors. Given a pixelgrid with a face image, the Eigenface algorithm models facesas a composition of different Eigenfaces, which we discuss indetail below. We now describe the Eigenface algorithm,which is parametrized by k where k is the number ofWP CA arg maxW W T ST W WF LD arg maxWT W T WPCA SB WP CA W T W T WPS WP CA W CA WThe transformation matrix W is given by:W WFTLD WPTCA1

5.3 Gabor Waveletsseveral manually annotated or tagged photos.To provide a computationally inexpensive means to extractrelevant features from our image, we applied various Gaborwavelets, a selective filter for both scale and orientation ofthe face. Since it convolves a Gaussian kernel and sinusoidFFT, the filter is parameterized by the Gaussian σ, sinusoidphase φ, orientation w, and wavelength λ.We took a variety of approaches to clustering these filteredimages. Let us assume that our training set contains nimages of different people.6.1 k-means Clustering6.1.1 Model Selection with CV using AICOne metric we consider for determining the number ofclusters (used as a parameter) for k-means clustering in theset of images is a modified form of the Akaike InformationCriterion (AIC). We note that the reconstruction cost is amonotonically decreasing function in k, which is minimizedwhen k n. In other words, if we minimize reconstructioncost, the optimal clustering will place each image in its owncluster, which is clearly not desirable. Using AIC, however,we can impose a penalty for each additional cluster topenalize more complex models. Cross-validation using AICoptimizes:1. Generate g Gabor kernels by varying the values ofσ, φ, w, and λ.2. For each image 1 to n: Represent the image as a list ofg convolutions of each Gabor kernel with the originalimage. To reduce the feature space, we also attemptedto represent the image as a list of g means andvariances, based on least squared error for simplicity.k arg mink RC(k) λkWe note that a larger value of λ will favor solutions withfewer clusters. Using the AIC metric, we have that λ 2M ,where M is the number of features used to represent animage. For our application, however, we have found thatexperimentally using λ 0.05M yields a more appropriatepenalty term due to the larger feature space R10000 used torepresent the pixel grid of an image.3. During k-means clustering, centroids take on thedimensionality of the filtered images.6.1.2 Model Selection with Tuned RegularizationTermSince the cost function incurred by k-means monotonicallydecreases with the number of clusters, we append thefollowing tuned regularization term which varies with thenumber of clusters:R(k) θT φ(k) f or φ(k) [1 k k2 ].Figure 2: Gabor Filters & Feature ExtractionTo tune the parameters θ for our regularization term, wefirst selected 100 random samples of 10 images from ourdataset of images. Therefore, the correct number of clustersin each of these 100 examples ranged from 1 to 10. Then, wedefined the following objective function to minimize bystochastic gradient descent:P2211minθ mi 1 2 arg mink (C(k) R(k)) ki ) 2 λ θ 25.4 Neural NetworksIn addition to the more engineered feature extractiontechniques discussed, recent work has shown the validity ofapplying the Deep Learning framework to vision tasks. Wenow consider how the Deep Learning framework can beapplied to the face recognition problem in Facebook'sDeepFace system. DeepFace feature extraction pipelineconsists of two phases: an alignment phase and arepresentation phase. The alignment phase involves explicit3D modeling of the face based on fiducial points in order towarp an input facial image into a cropped 3D frontal mode,allowing DeepFace to learn from raw pixel RGB values. Thenetwork architecture of DeepFace is nine-layers deep andinvolves more than 120 million parameters. The first threelayers of the neural network are used to extract low-levelfeatures such as edges and texture. The middle layers arelocally connected and apply various filters to differentregions based on local patterns. The final layers of theneural network are fully connected and are able to capturecorrelations between features in distant parts of the images.The output of running an image through the neural networkis then used as the feature representation [6].However, the presence of the argmin resulted in adiscontinuous function for which the gradient could not becomputed. Therefore, we reformulated our objectivefunction to a continuous function minimized at the same θ:P2211minθ mi 1 2 (mink [(C(k) R(k)) (C(ki ) R(ki ))] 2 λ θ 2With this objective function, we applied stochastic gradientdescent on the set of examples to tune our parameters θ.6.1.3 Unsupervised Clustering with EigenfacesAs an unsupervised clustering approach, we considerk-means with Eigenfaces. We first use the PCA algorithmknown as Eigenfaces, described in Section 5.1, to extractprincipal components from a set of examples images that areindependent of the actual images we wish to cluster. In ourapplication, we use 7000 images from the Labeled Faces inthe Wild face database to extract the eigenfaces that can beconsidered as representative of the space of all possiblefaces. We then project the face images that we wish tocluster into the space specified by the extracted eigenfaces.6 Unsupervised ClusteringIn order to provide the best user experience, we will beginwith unsupervised facial clustering techniques in order toprovide automatic photo organization without requiring2

system, has accurately clustered the previous stream ofimages, we can use the cluster assignments as labels tocreate a training set for supervised learning. We haveconsidered two canonical supervised face recognitionalgorithms, namely Eigenfaces and Fisherfaces. ForEigenfaces, we use an SVM with eigenface features, and forFisherfaces, we extract fisherfaces features and run nearestneighbors.Thus, each image is reduced to a feature vector of weightsrepresenting the contribution of each eigenface to theparticular image. We can then run the k-means algorithmon the images using the reduced dimension feature vectors.6.2 LBPWe also considered an approach using the local binarypatterns histograms (LBPH) algorithm, which extracts localfeatures of the image and is rooted in two-dimensionaltexture analysis.8 Results6.2.1 ModelWe now describe the model of the LBPH algorithm. Thealgorithm functions by comparing each pixel with itsneighborhood; if the center pixel'intensity is greater than orequal to that of its neighbors, then denote this relationshipwith a 0. Otherwise, denote it with a 1. More formally, thiscan be written as follows:P 1 PLBP (xc , yc ) Pp 0 2 s(ip ic ),8.1 Model SelectionWe first begin by considering the efficacy of model selectionfor k-means clustering using AIC.where (xc , yc ) is the central pixel with intensity ic , with inbeing the intensity of the neighbor pixel. Then, s would bethe sign function defined as follows: 1if x 0s(x) 0otherwiseFigure 3: k-means Cross-Validation Cost with AICThe LBP image is divided into m local regions (in ourimplementation, the LBP image is divided into 8x8 localregions), and a histogram is extracted from each; thespatially enhanced feature vector is obtained byconcatenating the local histograms.We find that running cross-validation with the modified costfunction based on AIC generates representative models forthe number of clusters in the training image set. We run thecross-validation algorithm described in Section 6.1.1 on twotraining sets. The training sets are generated from randomlysampling 70 percent of the AT&T and Yale face datasets.The AT&T dataset consists of 40 subjects and the Yaledataset consists of 15 subjects. Model selection using themodified cost function based on AIC estimates 36 subjectsfor the AT&T dataset and 18 subjects for the Yale dataset.We thus find that the AIC provides a reasonable and simpleheuristic for initially tuning the unsupervised clustering ofthe initial photo album.6.2.2 AlgorithmWe employ the following iterative algorithm:1. The first face image is used to initialize the firstcluster.2. We manually preset a threshold for a confidence valueat which a new cluster is created.3. Subsequent images are then run through the facerecognizer. If the confidence value is greater than ourthreshold, we instantiate a new cluster, otherwise weupdate the existing clusters.8.2 Unsupervised ClusteringTo gain a better understanding of the efficacy of the LBPHalgorithm, we consider two primary metrics: homogeneity,which describes the purity of a cluster, and completeness,which describes the purity of a class. We first consider thehomogeneity and completeness scores when performing 30trials of entirely unsupervised learning (with a training setof size 1) and testing on 50, 75, 100, 150, and 200 imagesfrom the AT&T Face Dataset.4. The images within each cluster that obtained thehighest confidence are considered representativesamples for the clusters and are used to re-initializethe face recognizer model.5. The algorithm repeats steps 2-5 for a few iterationsuntil convergence.Since the LBP algorithm extracts local features, we notethat it is quite robust against monotonic gray scaletransforms and thus limits the effects of confounding factorssuch as lighting.7 Supervised RecognitionWe imagine that at some point, once the system hasclustered a fair number of face images, the previouslyunsupervised task of facial clustering could be transitionedinto one of supervised facial recognition. Assuming that theFigure 4: LBP Evaluation with Train Size 1 onAT&T Face Dataset3

k-means clustering with eigenfaces, we see that thehomogeneity and completeness scores are slightly higherwith LBP for the AT&T and Yale Face Datasets, likely dueto the algorithm's robustness against monotonic grayscaletransformations.When performing entirely unsupervised clustering, we wereable to achieve a high homogeneity score of 0.873 andcompleteness score of 0.825. We now consider the option inwhich we have the user correctly tag a small subset ofphotos; when performing 30 trials of LBP with a trainingset of size 25 correctly tagged images and testing on 50, 75,100, 150, and 200 different images from the AT&T FaceDataset, we obtain the homogeneity and completenessscores pictured in the plot below.8.3 Supervised ClusteringWe first evaluate the eigenfaces algorithm. From the AT&TFace Database of 400 images containing 40 differentsubjects, we uniformly sample 70% of the images toconstruct our training data set and we test on the remaining30% of the images. Preliminary results using a Gaussiankernel yield a precision of 96% and a recall of 93%. Weperform a similar evaluation of Fisherfaces as thatperformed on Eigenfaces. From the AT&T Face Database of400 images containing 40 different subjects, we uniformlysample 70% of the images to construct our training data setand we test on the remaining 30% of the images.Preliminary results yield a classification accuracy of 89%.Figure 5: LBP Evaluation with Train Size 25 onAT&T Face DatasetWe see a general increase in LBP efficacy with a largerdataset, after which the scores seem to plateau. Asexpected, when correctly tagging 25 images, LBP performsachieves higher homogeneity and completeness scores; wewere able to achieve a high homogeneity score of 0.914 andcompleteness score of 0.877.Figure 7: 4 Eigenfaces & 4 Fisherfaces generatedfrom AT&T Face DatabaseAs a means of unsupervised learning, we consider how thek-means algorithm performs on the Yale face dataset usingdifferent features.For the supervised task of face recognition, we consideredEigenfaces and Fisherfaces. The experimental setupconsidered the AT&T Face Dataset and compared theEigenface and Fisherface features on a hold out set thatcomprised a random sampling of 30 percent of the dataset.Figure 6: k-means Baseline vs. Eigenface FeaturesWe assume that during the model selection phase that weare able to determine the correct number of subjects in thedataset. The results from the cross-validation experimentwith AIC show that this assumption is not unreasonable.We then consider using k-means with Eigenface features.Using Eigenfaces, we are able to improve the clustering to amaximum homogeneity of 0.799 and completeness of 0. The400 Eigenface features selected are representative of theLFW dataset and are assumed to generalize to the facialimages that comprise the Yale dataset.When we compare the LBP algorithm with unsupervisedFigure 8: Eigenfaces vs. Fisherfaces ComparisonWe note that the Fisherface accuracy plateaus after 40features. This results from the fact that the AT&T datasetcontains images of only 40 unique subjects and Fisherfacesuses a number of features that is at most the number ofdistinct labels. We decided to consider the Eigenface SVMfor our supervised experiments since its high-mark accuracy4

of confidence when evaluating whether a new photo belongsto an existing cluster. In our SVM, we note that the penaltyparameter term C is weighted such that the penalty term isinversely proportional to class frequencies.of 94.167% is higher than that of Fisherface, which is89.1667%. To explore whether it would be possible totransition to supervised learning with inaccurate clusters,we randomly mislabel 10 percent of the dataset.9 LiteratureTo date, Facebook's DeepFace algorithm, as described inSection 5.4, is one of the most accurate face recognitionalgorithms, achieving 97.35% accuracy on the Labeled Facesin the Wild (LFW) dataset [7]. Prior to Facebook'sDeepFace algorithm, the most successful system using alarge labeled face dataset adapted a joint generativeBayesian model learned on a dataset containing 99,773images from 2,995 different subjects to the LFW imagedomain [6]. Another approach involves the use of amulti-task deep convolutional neural network whichsimultaneously learns the face-nonface decision, the facepose estimation problem, and the facial landmarklocalization problem (locating major features or landmarksof a face) [2]. Multi-task learning was applied to the neuralnetwork using a shared representation for the differentproblems by learning multiple targets and making themshare the common lower layers. On the unsupervised front,a state-of-the-art algorithm that has been used for facialclustering is Over-Complete Local Binary Patterns(OCLBP), a multi-scale modified version of the LBPalgorithm [4]. While LBP capitalizes on the locality of itsfeature extraction, OCLBP is computed with overlappingblocks and improves the robustness of classification systemsby using richer descriptors.Figure 9: SVM with Eigenface Features on NoisyDatasetWe note that 10 percent noise is less than what would begenerated by the unsupervised clustering algorithms weconsidered. Thus, it offers a reasonable baseline to considerwhether supervised recognition on noisy clusters is feasible.The result of training on a randomly sampled set of 70percent of the training data set and testing on theremaining 30 percent peaks at about an F1-Score of 60percent. The F1-Score is the harmonic mean of precisionand recall and we use it as a measure of a test's accuracy.From this result, we conclude that with the existingtechniques considered, we cannot perform supervisedrecognition on noisy data. Alternatively, the user couldprovide the necessary feedback to correct the clusters.10 Future WorkFor the purposes of this project, we have focused primarilyon the task of clustering photos that contain people; in thefuture, we would love to expand on this work by supportinguser queries over non-people photos as well and clusteringon criteria such as genre. We are also interested ininvestigating the tradeoffs between k-means clustering andLBP clustering for online unsupervised learning, using theSVM as an additional measure of confidence. Furthermore,we would like to attempt to extract other features includingedges or corners. Lastly, we would like to experiment withlocal PCA and local ICA as well.To measure the performance of the SVM trained onEigenface features, we trained on a randomly sampled set of70 percent of the Yale data set and tested on the remaining30 percent. We varied the number of Eigenface

process. We propose Vignette, a content management system for smartphone photography that improves the visual story by automating the organization of a user's photo album. We hope to provide dynamic organization of photo albums that supports user queries including requests