Face Recognition By Humans

Transcription

To appear in Face Recognition: Models and Mechanisms; Academic PressFace Recognition by HumansPawan Sinha*, Benjamin J. Balas, Yuri Ostrovsky, Richard RussellDepartment of Brain and Cognitive SciencesMassachusetts Institute of TechnologyCambridge, MA 02139*Corresponding author psinha@mit.edu The human visual system is remarkably proficient at the task of identifying faces, even underseverely degraded viewing conditions. A grand quest in computer vision is to design automatedsystems that can match, and eventually exceed, this level of performance. In order to accomplishthis goal, a better understanding of the human brain’s face processing strategies is likely to behelpful. With this motivation, here we review four key aspects of human face perceptionperformance: 1. Tolerance for degradations, 2. Relative contribution of geometric andphotometric cues, 3. The development of face perception skills, and 4. Biologically inspiredapproaches for representing facial-images. Taken together, the results provide strong constraintsand guidelines for computational models of face recognition.IntroductionThe events of September 11, 2001, in the United States compellingly highlighted the need forsystems that can identify individuals with known terrorist links. In rapid succession, three majorinternational airports, Fresno, St. Petersburg and Logan, began testing face recognition systems.While such deployment raises complicated issues of privacy invasion, of even greater immediateconcern is whether the technology is up to the task requirements.Real-world tests of automated face-recognition systems have not yielded encouraging results. Forinstance, face recognition software at the Palm Beach International Airport, when tested onfifteen volunteers and a database of 250 pictures, had a success rate of less than fifty per cent andnearly fifty false alarms per five thousand passengers (translating to two to three falsealarms/hour per checkpoint). Having to respond to a terror alarm every twenty minutes would, ofcourse, be very disruptive for airport operations. Furthermore, variations such as eyeglasses,small facial rotations and lighting changes, proved problematic for the system. Many other suchtests have yielded similar results.The primary stumbling block in creating effective face recognition systems is that we do notknow how to quantify similarity between two facial images in a perceptually meaningful manner.1

Figure 1 illustrates this issue. Images 1 and 3 show the same individual from the front and obliqueviewpoints, while image 2 shows a different person from the front. Conventional measures ofimage similarity (such as the Minkowski metrics (Duda and Hart, 1973)) would rate images 1 and2 to be more similar than images 1 and 3. In other words, they fail to generalize across importantand commonplace transformations. Other transforms that lead to similar difficulties includelighting variations, aging and expression changes. Clearly, similarity needs to be computed overattributes more complex than raw pixel values. To the extent that the human brain appears to havefigured out which facial attributes are important for subserving robust recognition, it makes senseto turn to neuroscience for inspiration and clues.Figure 1. Most conventional measures of image similarity would declare images 1 and 2 to bemore similar than images 1 and 3, even though both members of the latter pair, but not theformer, are derived from the same individual. This example highlights the challenge inherent inthe face recognition task.Work in the neuroscience of face perception can influence research on machine vision systems intwo ways. First, studies of the limits of human face recognition abilities provide benchmarksagainst which to evaluate artificial systems. Second, studies characterizing the responseproperties of neurons in the early stages of the visual pathway can guide strategies for image preprocessing in the front-ends of machine vision systems. For instance, many systems use a waveletrepresentation of the image that corresponds to the multi-scale gabor-like receptive fields foundin the primary visual cortex (DeAngelis et al, 1993, Lee, 1996). We describe these and relatedschemes in greater detail later. However, beyond these early stages, it has been difficult to discernany direct connections between biological and artificial face-recognition systems. This is perhapsdue to the difficulty in translating psychological findings into concrete computationalprescriptions.A case in point is an idea that several psychologists have emphasized - that facial configurationplays an important role in human judgments of identity (Bruce and Young, 1998; Collishaw andHole, 2000). However, the experiments so far have not yielded a precise specification of what ismeant by 'configuration' beyond the general notion that it refers to the relative placement of thedifferent facial features. This makes it difficult to adopt this idea in the computational arena,especially when the option of using individual facial features such as eyes, noses and mouths is somuch easier to describe and implement. Thus, several current systems for face recognition andalso for the related task of facial composite generation (creating a likeness from a witnessdescription), are based on a piecemeal approach.2

As an illustration of the problems associated with the piecemeal approach, consider the facialcomposite generation task. The dominant paradigm for having a witness describe a suspect's faceto a police officer involves having him/her pick out the best matching features from a largecollection of images of disembodied features. Putting these together yields a putative likeness ofthe suspect. The mismatch between this piecemeal strategy and the more holistic facial encodingscheme that may actually be used by the brain can lead to problems in the quality ofreconstructions as shown in figure 2. In order to create these faces, we enlisted the help of anindividual who had several years of experience with the IdentiKit system, and had assisted thepolice department on multiple occasions for creating suspect likenesses. We gave himphotographs of fourteen celebrities and requested him to put together IdentiKit reconstructions.There were no strict time constraints (the reconstructions were generated over two weeks) and theIdentiKit operator did not have to rely on verbal descriptions; he could directly consult the imageswe had provided him. In short, these reconstructions were generated under ideal operatingconditions.Figure 2. Four facial composites generated by an IdentiKit operator at the authors’ request. Theindividuals depicted are all famous celebrities. The operator was given photographs of thecelebrities and was asked to create the best likenesses using the kit of features in the system. Mostobservers are unable to recognize the people depicted, highlighting the problems of using apiecemeal approach in constructing and recognizing faces. The celebrities shown are, from left toright: Bill Cosby, Tom Cruise, Ronald Reagan and Michael Jordan.The wide gulf between the face recognition performance of humans and machines suggests that thereis much to be gained by improving the communication between human vision researchers on the onehand, and computer vision scientists on the other. This paper is a small step in that direction.What aspects of human face recognition performance might be of greatest relevance to the work ofcomputer vision scientists? The most obvious one is simply a characterization of performance thatcould serve as a benchmark. In particular, we would like to know how human recognitionperformance changes as a function of image quality degradations that are common in everydayviewing conditions, and that machine vision systems are required to be tolerant to. Beyondcharacterizing performance, it would be instructive to know about the kinds of facial cues that thehuman visual system uses in order to achieve its impressive abilities. This provides a tentativeprescription for the design of machine-based face recognition systems. Coupled with this problem ofsystem design is the issue of the balance between hard-wiring and on-line learning. Does the humanvisual system rely more on innately specified strategies for face processing, or is this a skill thatemerges via learning? Finally, a computer vision scientist would be interested in knowing whether3

there are any direct clues regarding face representations that could be derived from a study of thebiological systems. With these considerations in mind, we shall explore the following fourfundamental questions in the domain of human vision.1. What are the limits of human face recognition abilities, in terms of the minimum image resolutionneeded for a specified level of recognition performance?2. What are some important cues that the human visual system relies upon for judgments of identity?3. What is the timeline of development of face recognition skills? Are these skills innate or learned?4. What are some biologically plausible face representation strategies?1. What are the limits of human face recognition skills?The human visual system (HVS) often serves as an informal standard for evaluating machine visionapproaches. However, this standard is rarely applied in any systematic way. In order to be able to usethe human visual system as a useful standard to strive towards, we need to first have a comprehensivecharacterization of its capabilities.In order to characterize the HVS's face recognition capabilities, we shall describe experimentsthat address two issues: 1. how does human performance change as a function of imageresolution? and 2. what are the relative contributions of internal and external features at differentresolutions? Let us briefly consider why these two questions are worthy subjects of study.The decision to examine recognition performance in images with limited resolution ismotivated by both ecological and pragmatic considerations. In the natural environment, the brainis typically required to recognize objects when they are at a distance or viewed under sub-optimalconditions. In fact, the very survival of an animal may depend on its ability to use its recognitionmachinery as an early-warning system that can operate reliably with limited stimulus information.Therefore, by better capturing real-world viewing conditions, degraded images are well suited tohelp us understand the brain’s recognition strategies.Many automated vision systems too need to have the ability to interpret degraded images.For instance, images derived from present-day security equipment are often of poor resolutiondue both to hardware limitations and large viewing distances. Figure 3 is a case in point. It showsa frame from a video sequence of Mohammad Atta, a perpetrator of the World Trade Centerbombing, at a Maine airport on the morning of September 11, 2001. As the inset shows, theresolution in the face region is quite poor. For the security systems to be effective, they need to beable to recognize suspected terrorists from such surveillance videos. This provides strongpragmatic motivation for our work. In order to understand how the human visual systeminterprets such images and how a machine-based system could do the same, it is imperative thatwe study face recognition with such degraded images.4

Fig. 3. A frame from a surveillance video showing Mohammad Atta at an airport in Maine onthe morning of the 11th of September, 2001. As the inset shows, the resolution available in theface region is very limited. Understanding the recognition of faces under such conditions remainsan open challenge and motivates the work reported here.Furthermore, impoverished images serve as ‘minimalist’ stimuli, which, by dispensing withunnecessary detail, can potentially simplify our quest to identify aspects of object informationthat the brain preferentially encodes.The decision to focus on the two types of facial feature sets – internal and external, is motivatedby the marked disparity that exists in their use by current machine-based face analysis systems. Itis typically assumed that internal features (eyes, nose and mouth) are the critical constituents of aface, and the external features (hair and jaw-line) are too variable to be practically useful. It isinteresting to ask whether the human visual system also employs a similar criterion in its use ofthe two types of features. Many interesting questions remain unanswered. Precisely how doesface identification performance change as a function of image resolution? Does the relativeimportance of facial features change across different resolutions? Does featural saliency becomeproportional to featural size, favoring more global, external features like hair and jaw-line? Or,are we still better at identifying familiar faces from internal features like the eyes, nose, andmouth? Even if we prefer internal features, does additional information from external featuresfacilitate recognition? Our experiments were designed to address these open questions byassessing face recognition performance across various resolutions and by investigating thecontribution of internal and external features.Considering the importance of these issues, it is not surprising that a rich body of research hasaccumulated over the past few decades. Pioneering work on face recognition with low-resolutionimagery was done by Harmon and Julesz (1973a, 1973b). Working with block averaged imagesof familiar faces of the kind shown in figure 4, they found high recognition accuracies even withimages containing just 16x16 blocks. However, this high level of performance could have beendue at least in part to the fact that subjects were told which of a small set of people they weregoing to be shown in the experiment. More recent studies too have suffered from this problem.For instance, Bachmann (1991) and Costen et al. (1996) used six high-resolution photographsduring the ‘training’ session and low-resolution versions of the same during the test sessions. Theprior subject priming about stimulus set and the use of the same base photographs across thetraining and test sessions renders these experiments somewhat non-representative of real-worldrecognition situations. Also, the studies so far have not performed some important comparisons.5

Specifically, it is not known how performance differs across various image resolutions whensubjects are presented full faces versus when they are shown the internal features alone.Figure 4. Images such as the one shown here have been used by several researchers to assess thelimits of human face identification processes.(After Harmon and Julesz, 1973)Our experiments on face recognition were designed to build upon and correct some of theweaknesses of the work reviewed above. Here, we describe an experimental study with two goals:1. assessing performance as a function of image resolution and 2. determining performance withinternal features alone versus full faces.The experimental paradigm we used required subjects to recognize celebrity facial images blurredby varying amounts (a sample set is shown in figure 5). We used 36 color face images andsubjected each to a series of blurs. Subjects were shown the blurred sets, beginning with thehighest level of blur and proceeding on to the zero blur condition. We also created two otherstimulus sets. The first of these contained the individual facial features (eyes, nose and mouth),placed side by side while the second had the internal features in their original spatialconfiguration. Three mutually exclusive groups of subjects were tested on the three conditions. Inall these experiments, subjects were not given any information about which celebrities they wouldbe shown during the tests. Chance level performance was, therefore, close to zero.Figure 5. Unlike current machine based systems, human observers are able to handle significantdegradations in face images. For instance, subjects are able to recognize more than half of allfamous faces shown to them at the resolution depicted here. The individuals shown from left toright, are: Prince Charles, Woody Allen, Bill Clinton, Saddam Hussein, Richard Nixon andPrincess Diana.6

ResultsFigure 6 shows results from the different conditions. It is interesting to note that in thefull-face condition, subjects can recognize more than half of the faces with image resolutions ofmerely 7x10 pixels. Recognition reaches almost ceiling level at a resolution of 19x27 pixels.Figure 6. Recognition performance with internal features (with and without configural cues).Performance obtained with whole head images is also included for comparison.Performance of subjects with the other two stimulus sets is quite poor even with relatively smallamounts of blur. This clearly demonstrates the perceptual importance of the overall headconfiguration for face recognition. The internal features on their own and even their mutualconfiguration is insufficient to account for the impressive recognition performance of subjectswith full face images at high blur levels. This result suggests that feature-based approaches torecognition are likely to be less robust than those based on the overall head configuration. Figure7 shows an image that underscores the importance of overall head shape in determining identity.Figure 7. Although this image appears to be a fairly run-of-the-mill picture of Bill Clinton and AlGore, a closer inspection reveals that both men have been digitally given identical inner facefeatures and their mutual configuration. Only the external features are different. It appears,7

therefore, that the human visual system makes strong use of the overall head shape in order todetermine facial identity. (From Sinha and Poggio, 1996)In summary, these experimental results lead us to some very interesting inferences about facerecognition:1. A high-level of face recognition performance can be obtained even with resolutions as low as12x14 pixels. The cues to identity must necessarily include those that can survive across massiveimage degradations. However, it is worth bearing in mind that the data we have reported herecome from famous face recognition tasks. The results may be somewhat different for unfamiliarfaces.2. Details of the internal features, on their own are insufficient for subserving a high-level ofrecognition performance.3. Even the mutual spatial configuration of the internal features is inadequate to explain theobserved recognition performance, especially when the inputs are degraded. It appears that,unlike the belief in computer vision, where internal features are of paramount importance, thehuman visual system relies on the relationship between external head geometry and internalfeatures. Additional experiments (Jarudi and Sinha, 2005) reveal a highly non-linear cuecombination strategy for merging information from internal and external features. In effect, evenwhen recognition performance with internal and external cues by themselves is statisticallyindistinguishable from chance, performance with the two together is very robust.These findings are not merely interesting facts about the human visual system; rather, they help inour quest to determine the nature of facial attributes that can subserve robust recognition. In asimilar vein of understanding the recognition of ‘impoverished’ faces, it might also be useful toanalyze of the work of minimalist portrait artists, especially caricaturists, who are able to capturevivid likenesses using very few strokes. Investigating which facial cues are preserved or enhancedin such simplified depictions can yield valuable insights about the significance of different facialattributes.We next turn to an exploration of cues for recognition in more conventional settings – highquality images. Even here, it turns out, the human visual system yields some very informativedata.2. What cues do humans use for face recognition?Beyond the difficulties presented by sub-optimal viewing conditions, as reviewed above, a keychallenge for face recognition systems comes from the overall similarity of faces. Unlike manyother classes of objects, faces share the same overall configuration of parts and the same scale.The convolutions in depth of the face surface vary little from one face to the next, and for themost part, the reflectance distributions across different faces are also very similar. The result ofthis similarity is that faces seemingly contain few distinctive features that allow for easydifferentiation. Despite this fundamental difficulty, human face recognition is fast and accurate.This means that we must be able to make use of some reliable cues to face identity, and in thissection we will consider what these cues could be.For a cue to be useful for recognition, it must differ between faces. Though this point is obvious,it leads us to the question, what kinds of visual differences could there be between faces? Theobjects of visual perception are surfaces. At the most basic level, there are three variables that gointo determining the visual appearance of a surface 1) the light that is incident on the surface 2)8

the shape of the surface, and 3) the reflectance properties of the surface. This means that any cuethat could be useful for the visual recognition of faces can be classified as a lighting cue, a shapecue, or a surface reflectance cue. We will consider each of these three classes of cues, evaluatingtheir relative utility for recognition.Illumination has a large effect on the image level appearance of a face, a fact well known toartists and machine vision researchers. Indeed, when humans are asked to match different imagesof the same face, performance is worse when the two images of a face to be matched areilluminated differently (Braje et al 1998; Hill and Bruce 1996), although the decline inperformance is not as sharp as for machine algorithms. However, humans are highly accurate atnaming familiar faces under different illumination (Moses et al 1994). This finding fits ourinformal experience, in which we are able to recognize people under a wide variety of lightingconditions, and do not experience the identity of person as changing when we walk with them,say, from indoor to outdoor lighting. There are certain special conditions when illumination canhave a dramatic impact on assessments of identity. Lighting from below is a case in point.However, this is a rare occurrence; under natural conditions, including artificial lighting, faces arealmost never lit from below. Consistent with the notion that the representation of facial identityincludes this statistical regularity, faces look odd when lit from below, and several researchershave found that face recognition is impaired by bottom lighting (Enns and Shore 1997; Hill andBruce 1996; Johnston et al 1992; McMullen et al 2000). These findings overall are consistentwith the notion that the human visual system does make some use of lighting regularities forrecognizing faces.The two other cues that could potentially be useful for face recognition are the shape andreflectance properties of the face surface. We will use the term ‘pigmentation’ here to refer to thesurface reflectance properties of the face. Shape cues include boundaries, junctions, andintersections, as well any other cue that gives information about the location of a surface in space,such as stereo disparity, shape from shading, and motion parallax. Pigmentation cues includealbedo, hue, texture, translucence, specularity, and any other property of surface reflectance.‘Second order relations’, or the distances between facial features such as the eyes and mouth, area subclass of shape. However, the relative reflectance of those features or parts of features, suchas the relative darkness of parts of the eye or of the entire eye region and the cheek, are a subclassof pigmentation. This division of cues into two classes is admittedly not perfect. For example,should a border defined by a luminance gradient, such as the junction of the iris and sclera beclassified as a shape or pigmentation cue? Because faces share a common configuration of parts,we classify such borders as shape cues when they are common to all faces (e.g. the iris-scleraborder), but as pigmentation cues when they are unique to an individual (e.g. moles and freckles).It should also be noted that this is not an image-level description. For example, a particularluminance contour could not be classed as caused by shape or pigmentation from local contrastalone. Although the classification cannot completely separate shape and pigmentation cues, itdoes separate the vast majority of cues. We believe that dividing cues for recognition into shapeand pigmentation is a useful distinction to draw.Indeed, this division has been used to investigate human recognition of non-face objects, althoughin that literature, pigmentation has typically been referred to as ‘color’ or ‘surface’. Much of thiswork has compared human ability to recognize objects from line drawings or images withpigmentation cues, such as photographs. The assumption here is that line drawings contain shapecues, but not pigmentation cues, and hence the ability to recognize an object from a line drawingindicates reliance on shape cues. In particular, these studies have found recognition of linedrawings to be as good (Biederman and Ju 1988; Davidoff and Ostergaard 1988; Ostergaard andDavidoff 1985) or almost as good (Humphrey 1994; Price and Humphreys 1989; Wurm 1993) as9

recognition of photographs. On the basis of these and similar studies, there is a consensus that, inmost cases, pigmentation is less important than shape for object recognition (Palmer 1999;Tanaka et al 2001; Ullman 1996).In the face recognition literature too, there is a broadly held implicit belief that shape cues aremore important than pigmentation cues. Driven by this assumption, line drawings and other nonpigmented stimuli are commonly used as stimuli for experimental investigations of facerecognition. Similarly, many models of human face recognition use only shape cues, such asdistances, as the featural inputs. This only makes sense if it is assumed that the pigmentationcues being omitted are unimportant. Also, there are many more experimental studiesinvestigating specific aspects of shape than of pigmentation, suggesting that the researchcommunity is less aware of pigmentation as a relevant component of face representations.However, this assumption turns out to be false. In the rest of this section, we will reviewevidence that both shape and pigmentation cues are important for face recognition.First, we briefly review the evidence that shape cues alone can support recognition. Specifically,we can recognize representations of faces that have no variation in pigmentation, hence no usefulpigmentation cues. Many statues have no useful pigmentation cues to identity because they arecomposed of a single material, such as marble or bronze, yet are recognizable as representationsof a specific individual. Systematic studies with 3D laser-scanned faces that similarly lackvariation in pigmentation have found that recognition can proceed with shape cues only (Bruce etal 1991; Hill and Bruce 1996; Liu et al 1999). Clearly, shape cues are important, and sometimessufficient, for face recognition.However, the ability to recognize a face in the absence of pigmentation cues does not mean thatsuch cues are not used under normal conditions. To consider a rough analogy, the fact that wecan recognize a face from a view of only the left side does not imply that the right side is not alsorelevant to recognition. There is reason to believe that pigmentation may also be important forface recognition. Unlike other objects, faces are much more difficult to recognize from a linedrawing than from a photograph (Bruce et al 1992; Davies et al 1978; Leder 1999; Rhodes et al1987), suggesting that the pigmentation cues thrown away by such a representation may well beimportant.Recently, some researchers have attempted to directly compare the relative importance of shapeand pigmentation cues for face recognition. The experimental challenge here is to find a way tocreate a stimulus face that appears naturalistic, yet does not contain either useful shape orpigmentation cues. The basic approach that was first taken by O’Toole and colleagues (O'Tooleet al 1999) is to create a set of face representations with the shape of a particular face and theaverage pigmentation of a large group of faces, and a separate set of face representations with thepigmentation of an actual face and the average shape of many faces. The rationale here is that todistinguish among the faces from the set that all have the same pigmentation, subjects must useshape cues, and vice versa for the set of faces with the same shape. In O’Toole’s study, the faceswere created by averaging separately the shape and reflectance data from laser scans. The shapedata from individual faces was combined with the averaged pigmentation to create the set of facesthat differ only in terms of their shape, and vice versa for the set of faces with the same shape.Subjects were trained with one or the other set of face images, and were subsequently tested formemory. Recognition performance was about equal with both the shape and pigmentation sets,suggesting that both cues are important for recognition.A question that arises when comparing the utility of different cues is whether the relativeperformance is due to differences in image similarity or differences in processing. One way to10

address this is to equate the similarity of a set of images that differ by only one or the other cue.If the sets of images are equated for similarity, differences in recognition performance with thesets can be attributed to differences in processing. We investigated this question in our lab(Russell et al 2004) with sets of f

Therefore, by better capturing real-world viewing conditions, degraded images are well suited to help us understand the brain's recognition strategies. Many automated vision systems too need to have the ability to interpret degraded images. For instance, images derived from present-day security equipment are often of poor resolution