Machine Learning & Deep Learning

Transcription

Machine Learning & Deep LearningSoluzioni di grafica 3D in applicazioni biometricheLeonardo TanziPh.D. Studentleonardo.tanzi@polito.it

Machine LearningTableofContents Definition Types of learning Regression ClassificationDeep Learning Definition Neural Network Convolutional Neural NetworkCase Study Coding a CNN

MachineLearning

DefinitionApplication of Artificial Intelligence that provides a system with the ability to learn automaticallyfrom experience without being explicitly programmed for the given task.Arthur Samuel - 1959A computer program is said to learn from experience E with respect to some task T and someperformance measure P, if its performance on T, as measured by P, improves with experience E.Tom Mitchell - 1998InputDataAnalyzeDataFindPatternsPredictLearn fromFeedback

Types of forcementLearningThe machine learns fromlabeled dataThe machine learns withoutlabeled dataThe machine learns on itsown

Supervised LearningModelDataTrainingIt’s anApple!PredictionLabelsNew Data

Unsupervised LearningSupervised learningModelTrainingUnlabeled DataSearch for apattern

Reinforcement LearningSupervised learningIt’s a banana!Wrong!It’s anApple!NotedReinforcedpredictionUnknown DataResponseFeedbackLearningNew Data

onUsed when the output iscontinuous (ex. price of a house)Used when the output iscategorical (ex. Yes or No)Linear RegressionLogistic RegressionKNNDecision TreesSVM

rning

Regression: DefinitionRegression ModelSize (m2)Price ( )100120.0005380.000220340.00078110.000145.000 Input variablesNew Size131m2ContinuosOutput Variable

Regression: Linear RegressionHouse Price Prediction: estimate the price of a house given its sizeWe want to fit a line that best approximates the behaviorof the training data.PriceWe call this function hypothesis h(x) and it is a functionof π’˜πŸ (slope) and π’˜πŸŽ (y-intercept)h(x)β„Ž π‘₯ 𝑀0 𝑀1 π‘₯xSize How can we fit this line?Find the values of 𝑀0 and 𝑀1 which minimizes thedifference between the prediction and the actual values

Regression: Loss and Gradient DescentGradient DescentLoss𝐽(𝑀0 , 𝑀1 )𝑀0β„Žπ‘€ π‘₯ 𝑀0 𝑀1 π‘₯Changing 𝑀0 move the line up and downChanging 𝑀1 change the slopeSquared Error: J(𝑀0 , 𝑀1 ) 12Οƒ(β„Žπ‘€ π‘₯ 𝑦)Should be minimized!2𝑀1In the beginning, the line has random 𝑀0 and 𝑀1 .Gradient descent moves the point (by updating thevalues of 𝑀0 and 𝑀1 ) in the direction where theLoss function decrease.This process is repeated until a minimum isreached.

Classification: DefinitionClassificationModelFruitsLabeled inputvariablesAnimalsDiscreteOutput

Classification: Important TerminologiesKeywordDefinitionClassifierAn algorithm used to map the input data to a specific categoryClassification modelThe model that predicts a class given an input dataFeatureAn individual measurable property of the observed dataTrainThe process where the model learn to distinguish between classesPredictThe process where the model make a prediction on unseen dataBinary classificationCondition with two outcomes which are either true or falseMulti-class classificationCondition with more than two classes where each sample is assigned to one andonly labelMulti-label classificationCondition where each sample is assigned to a set of labels or targets

Classification: MetricsTraining a classifier to predict whether a person is diabetic or healthy (binary output). Four possible outcomes:These are goodTrue PositiveTPPred: diabeticCorrect: diabeticTrue NegativeTNPred: healthyCorrect: healthyFalse PositiveFPPred: diabeticCorrect: healthyFalse NegativeFNPred: healthyCorrect: diabeticThese are bad (especially FN)How many people werecorrectly labeled among allthe people?How many of thoselabeled as diabetic areactually diabetic?𝑇𝑃 π‘‡π‘π΄π‘π‘π‘’π‘Ÿπ‘Žπ‘π‘¦ 𝑇𝑃 𝐹𝑃 𝐹𝑁 π‘‡π‘π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› 𝑇𝑃𝑇𝑃 𝐹𝑃Of all the diabetic people,how many were ��𝑛𝑠𝑖𝑑𝑖𝑣𝑖𝑑𝑦 Of all the healthy people,how many were �𝑖𝑑𝑦 𝑇𝑃𝑇𝑃 𝐹𝑁𝑇𝑁𝐹𝑃 𝑇𝑁

Classification: Logistic RegressionIn linear regression, we computed a continuous value (the price of a house given its size).In logistic regression, we want to output a discrete value 𝑦.ො For example, in a binary problem, given the size ofa house, we want to build an algorithm that returns 0 if we can afford it and 1 if we can’t.SigmoidWe can’tafford itWe need an S-shapedfunction to fit this dataWe canafford itSize𝑦 11 𝑒 π‘₯

Classification: Logistic RegressionStarting from the linear regression function β„Žπ‘€ π‘₯ 𝑀0 𝑀1 π‘₯, we add a Sigmoid function that squeezes theoutput between 0 and 1, in this way, the system will predict values between 0 and 1 (values 0.5 will beassigned to one class while values 0.5 to the other): π’‰π’˜ 𝒙 𝝈(π’˜πŸŽ π’˜πŸ 𝒙)-log(x)Loss function has to be minimized!𝐽 𝑦,ො 𝑦 (𝑦log π‘¦ΰ·œ 1 𝑦 log 1 π‘¦ΰ·œ )xNB: in our case, x has a valuebetween 0 and 1, for this reason,the minimum value is obtainedfor x 1 and not for x 1If y 0 𝐽 𝑦,ො 𝑦 log 1 π‘¦ΰ·œWhen the input y is 0 we want π‘¦ΰ·œ close as0 as possible. The minimum value of log 1 π‘¦ΰ·œ is obtained when π‘¦ΰ·œ 0,because log 1 0 log 1 0If y 1 𝐽 𝑦,ො 𝑦 log π‘¦ΰ·œWhen the input y is 1, we want π‘¦ΰ·œ close as1 as possible. The minimum value of log π‘¦ΰ·œ is obtained when π‘¦ΰ·œ 1,because log 1 0

Classification: KNN IntuitionK Nearest NeighborsWe have a distribution, when a new sample arrives, we assign it to a class looking at theclasses of the k nearest samplesK 1K 3New sampleWe assign it to thebanana classWe assign it to theapple class

Classification: Decision Trees IntuitionIf we have a distribution like this,we can not fit a S-shaped function as weshowed for logistic sed on this data, wecan build a DecisionTree that is able toreturn the predictionfor a new personYesNoDoes notlike cinemasAge 12.5?YesDoes notlike cinemasNoDoes likecinemas

Classification: SVM IntuitionWe have just seen that we cannot fit anS-shaped function to this specific distributionSupport Vector MachineWe represent data in the lowest dimension possible, and with SVM we can add a new axis to thedata and move the points in a way that makes it relatively easy to draw a straight line that correctlyclassifies peopleAge2Support Vector Machines usesomething called KernelFunctions that we can be usedto systematically find higherdimensions’ axisAdd y-axisAge42Age

DeepLearning

IntroductionMany features / Big dataset Classic ML algorithm become inefficient Alternative: Deep LearningDeep Learning is a branch of Machine Learning which utilizes neural network for computationMain ideaNeural networks are algorithms inspired bythe human brain.The brain uses just a single learningalgorithm that can do anything, can we findan approximation of what the brain does andtry to implement it?Auditory Cortex

NeuronA biological neuron in the brain receives inputsfrom the dendrites, performs some computation,and outputs the result through the axonsDendriteAxonAn artificial neuron tries to mimic its behavior:it receives some inputs (x1, x2, x3), computes a linearfunction operation using weights (w1 , w2 , w3)𝑧 𝑀1 π‘₯1 𝑀2 π‘₯2 𝑀3 π‘₯3and applies a non-linear activation function toformulate the hypothesisβ„Žπ‘€ π‘₯ earactivation function

Neural NetworkA neural network is a group of these neurons.w11(1)Each neuron has a linear combination of inputsand weights and then applies non-linear activation,for example, for the first layer:x1w12(1)a2(1)x2hw(x)a1(1) 𝑔(𝑀11 (1)π‘₯1 𝑀21 (1)π‘₯2 𝑀31(1) π‘₯3 )a2(1) 𝑔(𝑀12 (1)π‘₯1 𝑀22 (1)π‘₯2 𝑀32 (1)π‘₯3 )a3(1)x3w14(1) Notationan(k) n indicates the neuron and k thelayerwij(k) i is the index of the neuron of theinput layer, j the index of thehidden layer, and k is the ly Connected Network (FCN)

Neural NetworkA neural network is a group of these neurons.w11(1)Each neuron has a linear combination of inputsand weights and then applies non-linear activation,for example, for the first layer:x1w12(1)a2(1)x2hw(x)a1(1) 𝑔(𝑀11 (1)π‘₯1 𝑀21 (1)π‘₯2 𝑀31(1) π‘₯3 )a2(1) 𝑔(𝑀12 (1)π‘₯1 𝑀22 (1)π‘₯2 𝑀32 (1)π‘₯3 ) Notationan(k) n indicates the neuron and k thelayerwij(k) i is the index of the neuron of theinput layer, j the index of thehidden layer, and k is the layera1(1)a3(1)x3w14(1)a4(1)InputNon-linear Hiddenactivation OutputLayerLayerfunctions makeneural Layernetworks flexible andable to fit just about anyFully Connected Network (FCN)data

Neural NetworkFor the second layer:a1And finally:w11(2)a1(2)x1 a1(2) 𝑔(𝑀11 (2)a1(1) 𝑀21 (2)a2(1) 𝑀31(2) a3(1) 𝑀41(2) a4(1))(1)a2(2)a2(1)x2h 𝑔(𝑀11 (3)a1(2) 𝑀21 (3)a2(2) 𝑀31(3) a3(2) 𝑀41(3) a4(2))hw(x)a3(1)a3(2)a4(1)a4(2)x3Each w(k) is a matrix of weights that controls themapping from layer k to layer k 1w44(2)

Neural Network Classification28x28 784 neuronsMNIST10 handwrittendigits28 286SoftmaxInput layern hidden layerOutputSoftmax is a function used for multiclass classification that turns avector of K real values into a vector of K real values that sum to 1[2.33, -1.46, 0.56]Softmax[0.83, 0.01, 0.14]

Neural Network Training28 28LossSoftmaxInput layern hidden layerOutputLabelOne of the most used functions for multi-class classification is cross-entropy loss, defined as:𝐢𝐸 𝑦𝑖 log(π‘¦ΰ·œπ‘– )𝑖𝑦𝑖 is the label vector with one-hot encoding, meaning that the number β€œ6” is coded as [0 0 0 0 0 0 1 0 0 0].π‘¦ΰ·œπ‘– is the (wrong) prediction after the Softmax, let’s say [0.01, 0.02, 0.11, 0.70, 0.01, 0.08, 0.03, 0.01, 0.02, 0.01],which is actually a β€œ3”.Thus, CE will be equal to -log(0.03) as all the other values are multiplied by 0. Given the minus sign, inminimizing this function, we are trying to maximize the 6th value in the output vector.

Gradient Descent & BackpropagationAs we already said, the loss function represents how much the prediction is similar to the actual values.For this reason, the more we minimize the loss function, the more the prediction will be accurate. To dothis, we need two powerful tools: gradient descent and backpropagation.Optimization: we want to find the value of w which minimizesJ(w). To do this, we will update w at each step by the quantity:Loss J(w)𝑀 𝑀 𝛼21𝑑𝐽(𝑀)𝑑𝑀𝛼 is the learning rate, which is a parameter that determines how bigthe steps are in the direction of the minimum.minwBackpropagation is the techniquethat allows to compute thederivative of the loss with respect tothe parameters.𝑑𝐽(𝑀)𝑑𝑀is the derivative of J(w) with respect to w, i.e., the slope. Thus, if𝑑𝐽 𝑀we are in 1, the slope is negative, so 𝛼 𝑑𝑀 will be a positive value,meaning that we are increasing w in the direction of the minimum.𝑑𝐽 𝑀On the contrary, if we are in 2, the slope is positive, so 𝛼 𝑑𝑀 willbe a negative value, meaning that we are decreasing w in thedirection of the minimum.

Neural Network Overview28 28LossSoftmaxInput layern hidden layerOutputLabelIn summary, during training, we take an image (or more than one, i.e., a batch), pass it through the network,and make a prediction. We use the prediction and the actual label to compute the loss. We then usebackpropagation to compute the derivative of the loss with respect to all the weights. Finally, we applygradient descent to adjust the value of the weights in the direction of the minimum. This process is repeateduntil the minimum (or a local minimum) of the loss is reached, meaning that the network has (hopefully)learned to distinguish the different classes.Reminder: all this works with 28 28 1 images, which can be flattened to just 784 neurons. But if we use high-resimages, such as 1920 1080 3, the input layer should have around 6 million neurons, becoming computationallyinfeasible

Important TerminologiesKeywordDefinitionBatchThe set of samples used in one iteration (that is, one gradient update) of trainingEpochA full training pass over the entire dataset such that each example has been seenonceLearning rateThe size of the steps of the gradient descent algorithmOverfittingThe model matches the training data so closely that it fails to make correctpredictions on new dataUnderfittingThe model has poor predictive abilities because it hasn't captured the complexityof the training dataTest/Train/ValidationThe three subsets in which the main dataset is usually dividedRegularizationTechniques which penalize the model complexity and prevent overfittingTransfer LearningTransfer information from one model trained on a task into another taskHyper-parametersNot learnable parameters, such as learning rate, number of epochs, etc.

Convolutional Neural NetworkUsing neural networks with images has two main disadvantages:1. Too much computation2. They are sensitive to the location of an object in an image (if we move the object, the NN may be notable to recognize him)How does the human brain recognize images?Loop circle patternThis approach isreplicated with filters inConvolutional NeuralNetworksIt’s a 9!Vertical line111-11-11-11-11-1111-11-1Filters to recognize loops and vertical lines

Convolutional Neural NetworkA CNN is an algorithm that can take in an input image and apply filters that, during training, learn to distinguishvarious aspects/objects in the imageCompared to training a network with flattened images as before, with CNNs is possible to capture theSpatial dependencies in an image. The architecture better fits the image dataset due to the reduction inthe number of parameters involved and shared 0110011525121 1 1 0 1 1 0 0 1 1 1 0 0 1 0 0 0 1 1 4Extract features20307037Max1383020AverageReduce size

Convolutional Neural NetworkFilters are location invariant feature detectors (the loopy circle is detected wherever it is in the image).In the illustrations below, the highlighted part in the output are the neurons activated by a loop circlein the input image1111111-111-111111111111-11111In CNN, these filters are not hardcoded as in this example but learned during training

Convolutional Neural NetworkFCN𝑛3 neuronsConvolution𝑛1 kernelsof size 5 5Poolingsize 2 2Convolution𝑛2 kernelsof size 5 5 𝑛1Classification10 neuronsPoolingsize 2 201228 28 128 28 𝑛114 14 𝑛114 14 𝑛297 7 𝑛2𝑛3In the first convolution layer, you slide each of the 𝑛1 filters (with a stepsize named stride) over the 28 28 1 input and compute the convolutionoperation. This results in an output of size 28 28 𝑛1. The height andwidth are the same because we used padding (dashed external square)and 𝑛1 is the depth dimension, as we are using 𝑛1 filter and each filterproduces a 28 28 output. The pooling operation simply halves both theheight and the width.The role of theConvolution andPooling layers is toreduce the images andextract featureswithout losinginformation which iscritical for getting agood prediction. Thefinal FCN learns tohandle the variety (inposition, shape, etc.) ofthe high-level featuresextracted with the firstpart of the network.

Case Study:Coding a CNN

Coding a CNNWe are now going to implement and train the network below.A Google Colab file is available here to reproduce the results.Five main steps:1. Load the dataset2. Define the CNN3. Create the CNN andthe optimizer4. Train the model5. Test the modelNB: teaching Python and PyTorch is outside the scope of thislesson. For this reason, the code will be discussed in the formof pseudocode, to understand the general behavior withoutfocusing too much on the implementation

Tom Mitchell - 1998 Input Data Analyze Data Find Patterns Predict Learn from Feedback. Types of Learning Supervised Learning Unsupervised Learning Reinforcement . Deep Learning is a branch of Machine Learning which utilizes neural network for computation. Neuron A biological neuron in the brain receives inputs