Pattern Recognition - University Of California, Irvine

Transcription

Pattern Recognition.A pattern is essentially an arrangement. Itis characterized by the order of theelements of which it is made, rather than bythe intrinsic nature of these elements.Norbert Wiener1

What is Pattern Recognition?zzPattern recognition (PR) is the scientificdiscipline that concerns the description andclassification (recognition) of patterns (objects)PR techniques are an important component ofintelligent systems and are used for manyapplication domains Decision making Object and pattern classification2

What is Pattern Recognition-Definitions from the literaturezzzzzz“The assignment of a physical object or event to one ofseveral pre-specified categories” –Duda and Hart“A problem of estimating density functions in a highdimensional space and dividing the space into the regions ofcategories or classes” – Fukunaga“Given some examples of complex signals and the correctdecisions for them, make decisions automatically for astream of future examples” –Ripley“The science that concerns the description or classification(recognition) of measurements” –Schalkoff“The process of giving names ω to observations x”, –SchürmannPattern Recognition is concerned with answering thequestion “What is this?” –Morse3

Pattern Recognition and Related AreaszzzzzzzzzImage processingVideo processingSpeech/audio processingNatural Language processingMachine learningNeural networksDatabase engineeringBioinformaticsMuch more .4

Machine PerceptionzBuild a machine that can recognizepatterns: Speech recognition Fingerprint identification OCR (Optical Character Recognition) DNA sequence identification5

An “Toy” Example: Fish Classificationz“Sort incoming Fish on a conveyoraccording to species using opticalsensing”Sea bassSpeciesSalmon6

zProblem Analysis Set up a camera and take some sample images toextract features Length Lightness Width Number and shape of fins Position of the mouth, etc This is the set of all suggested features to explore for use inour classifier!7

zPreprocessing Use a segmentation operation to isolate fishes from zone another and from the backgroundTo extract one fish for the next stepFeature extraction Measuring certain features of the fish to be classified Is one of the most critical steps in the patternrecognition system designzClassification Select the length of the fish as a possible feature fordiscrimination8

9

10

Adopt the lightness and add the width of thefishxT [x1, x2]FishLightness11Width

zWe might add other features that are not correlated with theones we already have. A precaution should be taken not toreduce the performance by adding such “noisy features”zIdeally, the best decision boundary should be the one whichprovides an optimal performance such as in the following figure:12

Pattern Recognition SystemszSensing Use of a transducer (camera or microphone) PR system depends on the bandwidth, theresolution sensitivity distortion of thetransducer, etc.zSegmentation and grouping Patterns should be well separated and shouldnot overlap13

14

zFeature extraction Discriminative features Invariant features with respect to translation, rotation andscale.zClassification Use a feature vector provided by a feature extractor toassign the object to a categoryzPost Processing Exploit context dependent information other than from thetarget pattern itself to improve performance15

The Design CyclezzzzzzData collectionFeature ChoiceModel ChoiceTrainingEvaluationComputational Complexity16

17

zData Collection zHow do we know when we have collected anadequately large and representative set ofexamples for training and testing the system?Feature Choice Depends on the characteristics of theproblem domain. Simple to extract, invariantto irrelevant transformation, insensitive tonoise.18

zModel Choice zUnsatisfied with the performance of our fishclassifier and want to jump to another class ofmodelTraining Use data to determine the classifier. Manydifferent procedures for training classifiersand choosing models19

zEvaluation Measure the error rate for: Different feature sets Different training methods Different training and test data setszComputational Complexity What is the trade-off between computationalease and performance?(How an algorithm scales as a function of thenumber of features, patterns or categories?)20

Supervised & Unsupervised LearningzSupervised learning zUnsupervised learning zA teacher provides a category label or cost for eachpattern in the training set ( i.e., ground truth based onexperts’ knowledge)The system forms clusters or “natural groupings” ofthe input patternsSemi-supervised learning Use both labeled and un-labeled patterns to reducethe labeling cost21

Approaches for PRzStatistical (StatPR)Patterns classified based on an underlying statistical model of thefeatures zNeural (NeurPR) zThe statistical model is defined by a family of class-conditional probabilitydensity functions Pr(x ci) (Probability of feature vector x given class ci)Classification is based on the response of a network of processing units(neurons) to an input stimuli (pattern) “Knowledge” is stored in the connectivity and strength of the synaptic weightsNeurPR is a trainable, non-algorithmic, black-box strategyNeurPR is very attractive since it requires minimum a priori knowledgewith enough layers and neurons, an ANN can create any complex decision regionSyntactic (SyntPR) Patterns classified based on measures of structural similarity “Knowledge” is represented by means of formal grammars or relationaldescriptions (graphs)SyntPR is used not only for classification, but also for description Typically, SyntPR approaches formulate hierarchical descriptions of complex patternsbuilt up from simpler sub patternsFrom: Ricardo Gutierrez-Osuna Texas A&MUniversity22

23

Decision-Theoretic MethodszDecision (discriminant) functionGiven pattern vector x ( x1 , x2 ,.xn ); x R nGiven pattern classes ω1 , ω2 ,.ωWFind W decision functions d1 (x), d 2 (x),.dW (x)such that, if x ωi then d i (x) d j (x) j 1,2,.W ; j izDecision boundaryd i ( x) d j ( x) 0 0 x ωi For two classes ωi , ω j : d ij (x) d i (x) d j (x) 0 ? 0 x ωj 25

Decision-Theoretic MethodszMatching: each class is represented by a protopypepattern vector. A predifined metric is needed zOptimal Statistical Classifiers zMinimum distance clasifierMatching by correlationBayes classifierNeural Networks PerceptronsLayersTraining24

MatchingzMinimum distance classifier Prototype def. as the mean vector of the class m j Comparing the Eucledian distances D j (x) x m j 1Decision function d j (x) xT m j mTj m j2 Decision boundary between two classesdij (x) di (x) d j (x) xT (mi m j ) 1Nj x jj 1,2,.,Wx ω jj 1,2,.,Wj 1,2,.,W1(m i m j )T (m i m j ) 02The surface is the perpendicular bisector of the line segment joining miand mj. For n 2 it is a line, for n 3 it is a plane, and for n 3 it is ahyperplane.Controlable mean separation and class spread26

MatchingzMatching by correlation Find matches of a subimage w(x,y) of size JxK within the imagef(x,y) of size MxN.c( x, y ) s t f ( s, t ) w( x s ), y t ) x 0,1,.M 1; y 0,1.N 1 a relative (normalized) correlation coefficient is prefered[f ( s, t ) f ( s, t )][w( x s ), y t ) w ] s tγ ( x, y ) 1/ 222{ s t [ f (s, t) f (s, t)] s t [w( x s), y t) w ] }vulnerable to scale changes or rotation changesThe non-normalized version can be realized in FFT domain as well27

Optimal Statistical ClassifierszOptimality in the sence that the classifier yields the lowestprobability of commiting classification error Probability that x comes from ωi: p(ωi /x) If x is assigned (wrongly) to ωj : loss Lij W Conditional averaged risk (loss) r j ( x) Lkj p (ω k / x)k 1p ( A / B ) [ p ( A) p( B / A)] / p ( B )p(x / ω k ) pdf of the patterns from class ω k ,1 Wr j ( x) Lkj p(x / ωk ) P(ω k ); where P(ω k ) probabilty of occurence of class ωkp(x) k 1 Bayes classifier minimises the cond. avaraged risk, i.e. assigns anew pattern x to the class ωi if:WWk 1q 1ri (x) r j (x) Lki p(x / ω k ) P(ω k ) Lqj p (x / ω q ) P(ω q)28

Bayes classifier (cont.)The loss for correct decision is 0, the loss for any incorrect decision is 1. Then :Lij 1 δ ij , where δ ij 1 if i j and δ ij 0 if i j.Wr j ( x ) (1 δ kj ) p ( x / ω k ) P (ω k ) p ( x ) p ( x / ω j ) P (ω j )k 1Decision : x ω i if p ( x ) p ( x / ω i ) P (ω i ) p ( x ) p ( x / ω j ) P (ω j )Equavalent ly : p ( x / ω i ) P (ω i ) p ( x / ω j ) P (ω j )Decision fuction : d j ( x ) p ( x / ω j ) P (ω j ) j 1,2,.WzTwo probabilities needed. While P(ωj) is easy to find (estimate), p(x/ωi)requires multivariate probability methods for its estimation. This is toocomplicated to be used in practice. Instead, analytical expressions(models) of the pdf are used. The necessary parameters are estimatedfrom sample patterns from each class.29

Bayes classifier (cont.)zB.C. for Gaussian pattern classes Consider 1-D case (n 1), two classes (W 2) governed byGaussian densities N(m1,σ1), N(m2,σ2)d j ( x) p( x / ω j ) P(ω j ) ( x m j )212πσ j2σ 2jeP(ω j ) j 1,2For n-dimensional case, instead of simple variance, covariancematrix is involvedp(x / ω j ) 1(2π )n/2Cj1/ 2e1 ( x m j )T ( x m j )2where m j E j {x}; C j E j {( x m j )T (x m j )}; E j {.} expected valueApproximations of E j {.} : m j 1Nj x ω j30x; C j 1Nj x ω jxxT m j m jT

Bayes classifier (cont.)zB.C. for Gaussian pattern classes Exponents allow working with natural logarithms, since logarithm isa monotonically increasing function preserving the numerical orderd j (x) ln p (x / ω j ) ln P (ω j )nn1d j ( x) ln P(ω j ) ln 2π ln C j (x m j )T C j1 (x m j222n1or d j (x) ln P(ω j ) ln C j (x m j )T C j1 (x m j22The decision functions in n-D space are hyperquadrics (quadraticfunction in n-D space).Simplifications:[[ ]]If all covariance matrices are equal linear decision functions(hyperplanes)C I, P(ωj) 1/W. Then Bayes clsfr reduces to min. distance clsfr31

Neural NetworkszPreliminaries Training patterns and training sets. The process by which a training set is used to obtain decision functions is called learning or traininge.g. the training is used to determine the parameters ofthe decision function (means, covariance matrices)The statistical properties of pattern classes often areunknown (hard to estimate). Such problems are besthandled by direct training (no need to makeassumptions regarding the unknown pdfs)Neurons: non-linear computing elements Organized in networks Learning mashines called perceptrons32

Perceptronzx0 1x1xnThe appropriate weights are applied tothe inputs, and the resulting weightedsum passed to a function whichproduces the output yw0w1wnyu i xi w i y f (u )y f (u ) f (xT w )33

Neural NetworkszPerceptron for two pattern classes (see fig. above). Weights modify the inputs to the input ot an activationfunctionDecision boundary is a hyperplane in n-D space First n coeff. establish the orientation, while wn 1determines the distance to the originFormally the free weight wn 1 can be included in thesummation part by assuming one more inputThe key problem is to find the weight vector w using agiven training set of patterns from each of two classes34

Neural NetworkszTraining algorithms Linearly separable classes: an iterative algorithm w(1) –initial weight vector (arbitrary chosen) at the k-th iterative step if y(k) ω1, and wT(k)y(k) 0, replace w(k) withw(k 1) w(k) cy(k), c is a positive correction incrementif y(k) ω2, and wT(k)y(k) 0, replace w(k) withw(k 1) w(k) cy(k)otherwise, leave w(k) unchanged w(k 1) w(k)The algorithm converges if the two training sets arelinearly separable.This is called the perceptron training theorem35

Neural NetworkszTraining for nonseparable classes: delta rule1(r wT y ) 2 , where r 1, if y ω1and r 1, if y ω 22J ( w ) has minimum when r wT y , hence adjust w in the direction of the negative gradientCriterion function : J ( w ) J ( w ) w (k 1) w (k ) α w w w ( k ) J ( w ) ( r wT y ) y w ( k 1) w ( k ) α ( r ( k ) wT (k ) y (k ))y (k ) ww (k 1) w (k ) Δw αe(k ) y (k ), where e( k ) r ( k ) wT (k ) y (k ),e(k ) is the error commited with w (k ). If change to w (k 1) but leave the same patterne(k ) r (k ) wT (k 1)y (k ). The change of the error isΔe( k ) (r (k ) wT (k 1) y (k )) (r (k ) wT (k )y (k )) (wT (k 1) wT (k ))y (k )Δe(k ) Δw T y (k ) αe(k ) y T (k ) y (k ) αe(k ) y (k )362

Neural NetworkszDelta rule comments: Changing the weights reduces the error by a factordetermined by α and energy of yThe choise of α controls the stability and speed ofconvergence. For stability 0 α 1. Practical range 0.1 α 1.0The algorithm converges over the patterns of thetraining set. For separable classes, the solution mayor may not produce a separating hyperplaneCan be generalized to more than two classes and fornon-linear decision functions37

Neural NetworkszMultilayer feedforward NN One output layer (Q) and several intermediate layers. Usually the first layer(A) has the dimension of the input vectorsEach neuron has similar structure as the perceptron model. The hardlimiting function has been replaced by soft-limiting ’sigmoid’. The smoothfunction is prefered because of the differentiability.h j (I j ) 11 e ( I j θ j ) / θ 0; θ j controls the offset, θ 0 controls the shapeThe parameter θj plays the same role as wn 1 in the perceptron model andcan be added as a weight to an additional inputLet K be the layer before the layer J. ThenI j k k1 w jk Ok for j 1,2.N jN Ok hk ( I k ) for k 1,2.N kTotal of NjxNk coefficients are necessary to specify the weighting Nj coeff.are needed to complete the nodes at layer J.h j (I j ) 11 e ( k k1 w jk Ok θ j ) / θ 0N;38

Multilayer Feedforward NNInputlayerOutputlayerHidden Layer39

FFNN NEURON MODELzzThe classical learning algorithm of FFNN is based on the gradientdescent method. For this reason the activation function used inFFNN are continuous functions of the weights, differentiableeverywhere.A typical activation function that can be viewed as a continuousapproximation of the step (threshold) function is the SigmoidFunction. A sigmoid function for node j is: ϕ (v )ϕ (v j ) 1 e1 avjwith a 0j1Increasing awhere v j w ji yiivjwith w ji weight of link from node ito node j and yi output of node i-10 -8 -6 -4 -2246810when a tends to infinity then ϕ tends to the step function40

Multilayer NNzTraining by backpropagationTotal squared error for the output layer : EQ 1 Nq2r O()qq 2 q 1Adjusting the weights in proportion to the partial derivatives : Δwqp αBy applying the chain rule : I q wqp wqp Oq wqp I q wqp p 1 wqp O p O p Δwqp α EQ I q (rq Oq );? δp Oq I q wqp EQ I qNpHow to compute EQ EQ EQ EQ I q EQ I qO p αδ p O p , where δ p EQ Oq Oq I q hq ( I q ) hq ' ( I q ) δ p (rq Oq )hq ' ( I q ) I qFinally : Δwqp α (rq Oq )hq ' ( I q )O p αδ q O p41 EQ I q

Multilayer NNzTraining by backpropagation: What happens in layer P ?Δwqj α ( rp O p )h p ' ( I p )O j αδ p O jwhere the error term is : δ p (rp O p ) h p ' ( I p ).All terms (except rp ) are known or can be observed in the network.How to restate δ p in terms (quantities) that are known (observeble)?δp E p O p E p I p E p O p O p I p q 1NQ O p I p h p ( I p ) I p E p q 1 I q O p E p I q I q;NQ h p ' ( I p ); rp E p O p ? NpN q E p w O O p p 1 qp p q 1 I q δ p h p ' ( I p ) q q1δ q wqpN42 wqp N q δ q wqpq 1

Multilayer NNzTraining by backpropagation: Summarise the procedure For any layers K and J (K precedes J) compute the weights wjk,which modify the connections between these two layers, by usingΔw jk αδ j Ok If J is the output layer, δj is δ j ( r j O j ) h j ' ( I q )If J is an internal layer and P is the next layer (to the right), δj isδ j h j ' ( I j ) p p1δ p w jp for j 1,2,.N j .N Using an activation function with θ0 1 yeldsh j ' ( I j ) O j (1 O j )δ j (r j O j )O j (1 O j ) for the output layerδ j O j (1 O j ) p p1δ p w jp for the internal layersN43

Recent Advances in PatternRecognitionzzzzzzNew applicationsStandard databases for performancebenchmark of algorithms and systemsFusion methods to improve performance Sensors, features, classifiersRobustness of algorithms and systemsUtilizing contexts or expert informationMany successful deployments Speech recognition, handwritten character recognition,face detection, fingerprint recognition, automaticvehicle guidance, visual inspections, computer aideddiagnosis for mammogram44

SummaryzzzzzPattern recognition systems aim to recognizepatterns based on their featuresClassification is an important step in patternrecognition systemsPattern recognition algorithms and systems havebeen widely used in many application domainsChallenges remain to achieve human likeperformanceMore in SGN-2556 Pattern Recognition, 5th term45

What is Pattern Recognition-Definitions from the literaturez“The assignment of a physical object or event to one of several pre-specified categories” –Duda and Hart z“A problem of estimating density functions in a high- dimensional space and dividing the space into the regions of categories or classes” – Fukunaga