Regularization For Deep Learning - GitHub Pages

Transcription

Regularization for DeepLearningLecture slides for Chapter 7 of Deep Learningwww.deeplearningbook.orgIan Goodfellow2016-09-27Adapted by m.n. for CMPS 392

Definition “Regularization is any modification we make to alearning algorithm that is intended to reduce itsgeneralization error but not its training error.” Developing more effective regularization strategies hasbeen one of the major research efforts in the field. Deep learning take:q the best fitting model (in the sense of minimizinggeneralization error) is a large model that has beenregularized appropriately!(Goodfellow 2016)

Regularization strategies Constraints: adding restrictions on the parametervalues. Soft constraints: Adding extra terms in the objectivefunction: qEncode Prior knowledge.qGeneric preference for a simpler modelEnsemble methods:qCombine multiple hypotheses to explain thetraining data(Goodfellow 2016)

Parameter norm penalties 𝜃 : all learnable parameters (weights and biases)𝑤: parameters affected by a norm penaltyq we take weights and exclude biases𝛼 0, q 𝛼 0, no regularizationΩ: norm functionq L1q L2(Goodfellow 2016)

!𝐿 Parameter regularizationaka. ridge regression1Ω 𝜽 𝒘2 𝒘! 𝒘𝑻 𝒘 𝒘Update step: 𝒘 𝒘 𝜖 𝛼 𝒘 𝜖 𝒘 𝐽 𝒘 𝒘(1 𝜖𝛼) 𝜖 𝒘 𝐽Let 𝒘 argmin 𝐽 M argmin N𝐽Let 𝒘q !!1 𝑻 𝒘 𝒘2𝒘Weights areshrunk by amultiplicativefactor𝒘Approximating 𝐽 in the neighborhood ofq O𝐽 𝒘 𝐽 𝒘 𝒘 𝒘 & 𝑯 𝒘 𝒘 𝒘 :No first orderterm since w* isthe minimum( 𝐽 𝒘 𝟎)(Goodfellow 2016)

" (regularized solution) compares toHow 𝒘unregularized solution 𝒘*? 2?What is the gradient of /𝐽 𝒘 at 𝒘2 𝑯 𝒘2 𝒘 /𝐽 𝒘 2 𝑯 𝒘2 𝒘 𝛼 𝒘2 𝟎 8𝐽 𝒘 q2 𝑯𝒘 𝑯 𝛼𝑰 𝒘2 𝑯 𝛼𝑰 "𝟏 𝑯𝒘 𝒘𝑯 is real and symmetricq q 𝑯 𝑸𝚲𝐐𝐓𝐓! 𝑸𝚲𝐐 𝛼𝑰𝒘! 𝑸 𝚲 𝛼𝑰𝒘"𝟏𝐓 𝐓𝑻 "𝟏𝑸𝚲𝐐 𝐰 𝑸𝚲𝐐 𝑸𝛼𝑰𝑸"𝟏𝑻𝑸𝑸𝚲𝐐𝐓 𝐰 𝑸 𝚲 𝛼𝑰𝒘! 𝑸 𝚲 𝛼𝑰𝑸𝚲𝐐𝐓 𝐰 "𝟏 𝑸𝑻 𝑸𝚲𝐐𝐓 𝐰 "𝟏 𝚲𝐐𝐓 𝐰 (Goodfellow 2016)

Interpretation! 𝑸 𝚲 𝛼𝑰𝒘 𝟏𝐓𝚲𝐐 𝐰 𝐰 projections against the eigen vectors of 𝑯 arescaled/!qComponent 𝑖 is multiplied byq𝜆2 𝛼 the effect of regularization is smallq𝜆2 𝛼 the corresponding component is shrunkby a factor of 𝛼/! 01(Goodfellow 2016)

Weight DecaySmall Eigenvector of H 𝐽 𝒘 𝐽 𝒘 1𝒘 𝒘 " 𝑯 𝒘 𝒘 2Unregularizedsolution(regularizationeffect is large)1𝐽 𝒘Large Eigenvector of H𝛼w 3 𝑤(regularization effectis small)Figure 7.1Regularizedsolution(Goodfellow 2016)

Special case: Linear Regression Cost function: 𝑿𝒘 𝒚 Normal equationsqq & 𝑻𝑿𝒘 𝒚 𝛼𝒘 𝒘!𝑿𝑻 𝑿𝒘 𝑿𝑻 𝒚 𝛼𝒘 𝟎 (𝑿𝑻 𝑿 𝛼𝑰)𝒘 𝑿𝑻 𝒚𝒘 𝑿𝑻 𝑿 𝛼𝑰1𝟏𝑿𝑻 𝒚Covariancefeature-outputProportional to thecovariance matrixBasically, we are adding 𝛼 to the diag.q The diag. elements correspond to the variance ofeach featureWe perceive the data as having higher varianceq A feature having low covariance with output gotshrunk even more due to this added variance(Goodfellow 2016)

!𝐿 regularization Ω 𝜽 𝒘 " # 𝑤#'𝐽 𝒘; 𝑿, 𝒚 𝛼 𝒘 " 𝐽(𝒘; 𝑿, 𝒚)q 𝒘 '𝐽 𝒘; 𝑿, 𝒚 𝛼 sign 𝒘 𝒘 𝐽 𝒘; 𝑿, 𝒚 6𝐽 𝒘 𝐽 𝒘 " 𝒘 𝒘 ' 𝑯 𝒘 𝒘 & 6𝐽 𝒘 𝑯 𝒘 𝒘 Assume that 𝑯 diag ( 𝐻"," , , 𝐻),) ), 𝐻#,# 0q Linear regression after PCA '𝐽 𝒘 𝐽 𝒘 #"𝐻#,#&Solution: 𝑤# 𝑠𝑖𝑔𝑛𝑤# q 𝑤# 𝑤# max𝑤# & 𝛼 𝑤#* ,0 &,&(Goodfellow 2016)

Interpretation𝑤# 𝑠𝑖𝑔𝑛 max𝑤# 𝛼 ,0𝐻#,#If 𝑤# 0:* q 𝑤# 𝑤# is shifted towards 0 byq 𝑤# 𝑤# &,&* &,&If 𝑤# 0: q 𝑤# o 𝑤# 0* &,& 𝑤# *𝑤# &,&𝑤# is shifted towards 0 by q 𝑤# * &,&* &,&* &,& 𝑤# 0(Goodfellow 2016)

!𝐿 regularization sparsity The sparsity property induced by L1 regularization can beused as a feature selection mechanismq LASSO regression (least absolute shrinkage andselection operator)Equivalent to MAP Bayesian estimation with Laplace priorq the prior is an isotropic Laplace distribution over 𝑤 ℝ) :ooLaplace"𝑤# ; 0,*log Laplace "𝑤# ; 0,*"exp( 𝛼&*𝑤# ) log 2𝛼 𝛼 𝑤#log 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 log 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 log 𝑝𝑟𝑖𝑜𝑟max log 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 min negative log likelihood log prior(Goodfellow 2016)

Norm Penalties MAP: Maximum A-PosterioriL1:q Encourages sparsity,q equivalent to MAP Bayesian estimation withLaplace priorSquared L2:q Encourages small weights,q equivalent to MAP Bayesian estimation withGaussian prior(Goodfellow 2016)

Explicit constraints We want to constrain Ω(𝜃) to be less than someconstant 𝑘qconstruct a generalized Lagrange function We can fix 𝛼 but lose 𝑘 The regularized training problem :𝐽 is equivalent tothe explicit constraints problem for an unknown 𝑘!(Goodfellow 2016)

Projection Sometimes we may wish to use explicit constraintsrather than penalties.q we can modify algorithms such as stochastic gradientdescent to take a step downhill on 𝐽 𝜃 and thenproject 𝜃 back to the nearest point that satisfiesΩ 𝜃 𝑘.How to project?q Project into unit L2 ball:q Project into unit L1 ball:o No closed-form solutiono Numerical solution(Goodfellow 2016)

Dataset Augmentation Best way to regularize is to train with more dataq create fake data and add it to the training set.q We can generate new (𝒙, 𝑦) pairs easily just bytransforming the 𝒙 inputs in our training set.q particularly effective for object recognitiono translating the training images a few pixels in eachdirectiono rotating the image or scalingSome inappropriate transformations:q horizontal flips: ‘b’ and ‘d’,q 180 rotations: ‘6’ and ‘9’,(Goodfellow 2016)

Dataset nHorizontal RandomHue ShiftTranslationflip(Goodfellow 2016)

Noise Robustness Noise with infinitesimal variance can be added:qAt the inputqAt the hidden layersqAt the weights:(Goodfellow 2016)

Injecting noise at the weights For 𝜂 small:q 𝑦\3𝑾 𝑦\ 𝑾 𝝐 𝑦\ 𝑾 𝝐& 𝒘 𝑦(𝑾)\!! 𝑦\ 𝔼𝑦\𝒙,7,𝝐𝑾3𝑾4 𝒙,7!!𝜖𝔼 𝑦\ 0𝝐𝑾4 𝒙,7𝒘q𝔼4𝔼4q𝔼4q𝐽 𝑾 𝐽 𝜂 𝔼4qEquivalent to adding a regularization term𝒙,7,𝝐𝑾𝑦\3!𝑾 𝔼4𝒙,7𝒙,7𝑦\ ! 𝜂𝔼4 9 𝑦(𝒙)\𝒙,7 9 𝑦\!!Pushes the model into regions where the model isrelatively insensitive to small variations in the weights(Goodfellow 2016)

Special case: linearregression 𝐽;𝑾 𝐽 𝜂 𝔼5 𝑦? 𝒘3 𝒙 𝑏 𝔼5 which is not a function of parameters and thereforedoes not contribute to the cost function w.r.t 𝒘:q𝒙,8𝒙,8 𝒘 𝑦(𝑥)?: 9 𝑦(𝒙)? 𝔼5𝒙:𝒙:No regularization effect!(Goodfellow 2016)

Injecting noise at the outputtargets Most datasets have some amount of mistakes in the y labels.It can be harmful to maximize log 𝑝(𝑦 𝑥) when 𝑦 is a mistake.One way to prevent this is to explicitly model the noise on the labels.q For example, we can assume that for some small constant 𝜖, thetraining set label 𝑦 is correct with probability 1 𝜖,q and otherwise any of the other possible labels might be correct.This assumption is easy to incorporate into the cost function analytically,q rather than by explicitly drawing noise samples.q For example, label smoothing regularizes a model based on a softmaxwith 𝑘 output valuesoby replacing the hard 0 by!"# and 1 by 1 𝜖Label smoothing has the advantage of preventing the pursuit of hardprobabilities without discouraging correct classification.o (Goodfellow 2016)

Multi-Task LearningUnsupervisedLearning contextTask specificparametersAmong the factors thatexplain the variationsobserved in the dataassociated with thedifferent tasks, some areshared across two ormore tasks.Figure 7.2(Goodfellow 2016)

Learning CurvesEarly stopping: terminate while validation setperformance is betterFigure 7.3(Goodfellow 2016)

Early stopping probably the most commonly used form ofregularization in deep learning.q the number of training steps (or training time) isjust another hyperparameter.The cost is running the validation set evaluationperiodically during trainingqReduce the validation setqEvaluate the validation loss less frequentlyPeriodically save the trained model(Goodfellow 2016)

Early stopping algorithm(Goodfellow 2016)

Re-use the validation setLess wellbehaved(Goodfellow 2016)

Early stopping as aregularizer 𝜖 (learning rate) and 𝜏 (number of training steps) limits the the volume ofparameters reachable from 𝜽𝟎 (initial parameters)Early stopping is equivalent to L2 regularization in the case of:q a simple linear modelq with a quadratic error functionq and simple gradient descent2𝐽 𝒘 𝐽 𝒘 𝒘 𝒘 ( 𝑯 𝒘 𝒘 'q 𝒘 2𝐽 𝒘 𝑯 𝒘 𝒘 𝒘(𝝉) 𝒘(𝝉#𝟏) 𝜖 𝒘 2𝐽 𝒘𝝉#𝟏𝝉#𝟏q𝒘(𝝉) 𝑰 𝜖𝑯 𝒘q𝒘(𝝉) 𝒘 𝑰 𝜖𝑯 𝒘q𝒘(𝝉) 𝒘 𝑰 𝜖𝑯 (𝒘 𝒘𝝉#𝟏 𝜖 𝑯 𝒘(𝝉#𝟏) 𝒘 𝜖𝑯𝒘 𝝉#𝟏𝝉#𝟏 𝜖𝑯 𝑰 𝒘 𝐰 )(Goodfellow 2016)

The number of steps 𝜏 corresponds to some value of the weightdecay coefficient 𝛼 𝒘(𝝉) 𝒘 𝑰 𝜖𝑯 (𝒘𝑯 𝑸𝚲𝑸𝑻q𝒘(𝝉) 𝒘 𝑸 𝑰 𝜖𝚲 𝑸𝑻 𝒘q𝑸𝑻 𝒘𝝉𝑸𝑻 𝒘𝑸𝑻 𝒘𝟏𝟐𝟎𝝉&𝟏 𝐰 𝟎: 𝑰 𝑰 𝜖𝚲 𝑸𝑻 𝒘 𝟐𝑸 𝑻 𝒘 𝑸𝑻 𝒘 𝝉 𝑰 𝑰 𝜖𝚲L2 regularization:𝝉𝑸 𝑻 𝒘 qq; 𝚲 𝛼𝑰𝑸𝑻 𝒘q; 𝑰 𝚲 𝛼𝑰𝑸𝑻 𝒘Compare:q 𝑰 𝚲 𝛼𝑰q 𝐰 if 𝜖 small 1 𝜖𝜆) 1, every step brings closer to 𝒘 𝑰 𝑰 𝜖𝚲q 𝝉&𝟏 𝒘 𝑰 𝜖𝚲 𝑸𝑻 𝒘Assume we start with 𝒘q 𝐰 )qo 𝝉&𝟏𝚲 𝛼𝑰&𝟏 𝛼&𝟏 𝛼&𝟏 𝚲𝐐𝐓 𝐰 &𝟏 𝛼𝐐𝐓 𝐰 and 𝑰 𝑰 𝜖𝚲 𝑰 𝜖𝚲𝜆!𝛼 1 𝜆! 𝛼𝜆! 𝛼𝝉1 1 𝜖𝜆./𝝉(Goodfellow 2016)

Early stopping advantage 01" 20/ 𝜏 log 1 𝜖𝜆. log # 2 " Assume log 1 𝑥 𝑥 for small enough 𝑥q 1 𝜖𝜆.Assume 𝜏𝜖𝜆. 1"01"0" 1 and 𝜖𝜆. 1 𝛼 /!the number of training iterations 𝜏 plays a role inversely proportional tothe L2 regularization parameter,q and the inverse of 𝜏𝜖 plays the role of the weight decay coefficient.Early stopping advantage over weight decay:q early stopping automatically determines the correct amount ofregularizationq while weight decay requires many training experiments with differentvalues of its hyperparameter.q (Goodfellow 2016)

Early Stopping and WeightDecayFigure 7.4(Goodfellow 2016)

Parameter tying Formally, we have model 𝐴 with parameters 𝒘 ;and model 𝐵 with parameters 𝒘( )The two models map the input to two different, butrelated outputs:𝑦?;q𝑦? q 𝑖, 𝒘(;) should be close to 𝒘( )q 𝑓(𝒘(;) , 𝒙) 𝑔(𝒘( ), 𝒙)RegularizationΩ 𝒘(;),𝒘( ) 𝒘(;) ( ) :𝒘:(Goodfellow 2016)

Parameter sharing (e.g. CNN) Force sets of parameters to be equal. Advantage:q only a subset of the parameters (the unique set)needs to be stored in memory.Natural images have many statistical properties that areinvariant to translation.qa photo of a cat remains a photo of a cat if it istranslated one pixel to the rightqParameter sharing has allowed CNNs to dramaticallylower the number of unique model parameters(Goodfellow 2016)

Sparse ns(Goodfellow 2016)

Bagging Bagging (short for bootstrap aggregating) is atechnique for reducing generalization error bycombining several modelsq train several different models separatelyq the models vote on the output for test examplesBagging is an example of model averaging.q The general term is Ensemble methods.The reason that model averaging works is thatdifferent models will usually not make all the sameerrors on the test set.(Goodfellow 2016)

Bagging example Consider for example a set of 𝑘 regression models.Suppose that each model makes an error 𝜖. on each example, with theerrors drawn from a zero-mean multivariate normal distribution'q with variances 𝐸[𝜖. ] 𝑣and covariances 𝐸 𝜖. 𝜖3 𝑐Then the error made by the average prediction of all the ensemble models is . 𝜖.q " The expected squared error of the ensemble predictor is:qq𝑐 𝑣 no gain, the expected error remains 𝑣𝑐 0 max gain, the expected error is 𝑣/𝑘(Goodfellow 2016)

Ensemble methods vs. bagging Different ensemble methods construct the ensemble of models indifferent ways.Bagging is a method that allows the same kind of model, trainingalgorithm and objective function to be reused several timesBagging involves constructing 𝑘 different datasets.q Each dataset has the same number of examples as the originaldataset,q but each dataset is constructed by sampling with replacementfrom the original dataset.o with high probability, each dataset is missing some of theexamples from the original dataset and also contains severalduplicate exampleso on average around 2/3 of the examples from the originaldataset are found in the resulting training set, if it has thesame size as the original(Goodfellow 2016)

Baggingthe detector learnsthat a loop on top ofthe digit correspondsto an 8.the detector learnsthat a loop on the bottomof the digit corresponds toan 8.(Goodfellow 2016)

Why 2/3 ? N: number of itemsK: number of unique itemsA: number of drawn items𝑁!𝑁 𝑘 ! 𝐴𝑃 𝑘 𝑘𝑁;All permutations of 𝑘among 𝑁 itemsAll ways to distribute𝐴 items among ksubsets such as nosubset is left emptyAll possible ways todraw A items 𝐴is a Stirling number of the second kind𝑘(Goodfellow 2016)

Expected number of duplicates The indicator 𝑑2 corresponds to original item 𝑖,taking the value of one if 𝑖 is present and zero if not𝑃 𝑑2 0 1 ? ;@? ;𝐸[𝑑2 ] 1 1 @𝐸 𝑑2 𝐸 𝑑2 𝑁𝐸 𝑑2 𝑁 1 1 𝐴 𝑁 𝐸 𝑘 𝑁 1 1 𝐸 𝑘 0.632 𝑁? @@? ;@ 𝑁 1 𝑒 ?(Goodfellow 2016)

More about bagging Neural networks reach a wide enough variety ofsolution points that they can often benefit frommodel averaging Model averaging is an extremely powerful andreliable method for reducing generalization error.q Its use is usually discouraged whenbenchmarking algorithms for scientific papersMachine learning contests are usually won bymethods using model averaging over dozens ofmodels.(Goodfellow 2016)

Dropout Dropout provides an inexpensive approximation to trainingand evaluating a bagged ensemble of exponentially manyneural networks.q removing non-output units from an underlying basenetworko by multiplying its output value by zeroEach time we load an example into a minibatch, werandomly sample a different binary mask to apply to all ofthe input and hidden units in the network.q The probability of sampling a mask value of one (causinga unit to be included) is a hyperparameter fixed beforetraining begins.o Typically, an input unit is included with probability 0.8and a hidden unit is included with probability 0.5(Goodfellow 2016)

DropoutFigure 7.6In networks with wider layers, theprobability of dropping all possiblepaths from inputs to outputsbecomes smaller.(Goodfellow 2016)

Dropout vs. bagging More formally, suppose that a mask vector 𝝁 specifies which unitsto include, and 𝐽(𝜽, 𝝁) defines the cost of the model defined byparameters 𝜽 and mask 𝝁.q Then dropout training consists in minimizing 𝔼𝝁 𝐽 𝜽, 𝝁 .q The expectation contains exponentially many terms (2/ )Dropout training is not quite the same as bagging training.q In the case of bagging, the models are all independent.q In the case of dropout, the models share parametersq In bagging, each model is trained to convergence on itsrespective training setq In dropout, a tiny fraction of the possible sub-networks are eachtrained for a single stepq In both, the training set encountered by each sub-network is asubset of the original training set sampled with replacement(Goodfellow 2016)

Computational graph of dropout The entries of 𝝁 are binary and aresampled independently from eachother,q And is not a function of thecurrent value of the modelparameters or the input example(Goodfellow 2016)

Inference To make a prediction, a bagged ensemble must accumulate votesfrom all of its members.q We refer to this process as inference0In bagging, the prediction of the ensemble is 1 4230 𝑝2𝑦𝒙In dropout, the arithmetic mean is 𝝁 𝑝(𝝁)𝑝 𝑦 𝒙, 𝝁The geometric mean is𝑝Nensemble 𝑦 𝒙 -.R 𝑝 𝑦 𝒙, 𝝁𝝁 To guarantee that the result is a probability distribution,q we impose that none of the sub-models assigns probability 0 toany event,q and we renormalize the resulting distribution.(Goodfellow 2016)

Weight scaling inference rule Evaluate with the trained model with all units,qBut with the weights going out of unit 𝑖 multiplied by the probability of including unit 𝑖 (e.g.½)qThis corresponds to predict the geometric mean of the ensemble! Consider a softmax regression classifier with 𝑛 input variables represented by the vector 𝑣:𝑃 𝑦 𝑦! 𝒗 softmax 𝑾𝑻𝒗 𝒃 ! To index into the family of submodels:𝑃 𝑦 𝑦! 𝒗 softmax 𝑾𝑻(𝒗 𝒅) 𝒃𝑝Hensemble 𝑦 𝑦! 𝒙 !"O softmax 𝑾𝑻(𝒗 𝒅) 𝒃!𝒅 (,* "1𝑝Hensemble 𝑦 𝑦! 𝒙 exp 2 exp!1 ,* 𝑻2𝑾 𝒗 2 𝒃 2𝑾𝑻 𝒗 𝒅 𝒃S𝒅 (,*!" exp1 𝑻𝑾 𝒗 𝒃2!!(Goodfellow 2016)

Another perspective ofdropout (1) Droput is bagging with parameter sharing(2) Information erasing: Each hidden unit must be able to performwell regardless of which other hidden units are in the modelq Dropout thus regularizes each hidden unit to be not merely agood feature but a feature that is good in many contexts.q For example, if the model learns a hidden unit ℎ2 that detects aface by finding the nose,q then dropping ℎ2 corresponds to erasing the information thatthere is a nose in the image.q The model must learn another ℎ2 ,o either that redundantly encodes the presence of a nose,o or that detects the face by another feature, such as themouth(Goodfellow 2016)

Adversarial examples Search for an input 𝒙′ near a data point 𝒙 such that themodel output is very different at 𝒙′In many cases, 𝒙’ can be so similar to 𝒙 that a humanobserver cannot tell the difference between the originalexample and the adversarial example,q but the network can make highly different predictions.Adversarial trainingq training on adversarially perturbed examples from thetraining setAdversarial examples are interesting in the context ofregularizationq because one can reduce the error rate on the originali.i.d. test set via adversarial training(Goodfellow 2016)

Adversarial ExamplesFigure 7.8Training on adversarial examples is mostlyintended to improve security, but can sometimesprovide generic regularization.(Goodfellow 2016)

Aversarial training The value of a linear function can change very rapidly if it hasnumerous inputs.q If we change each input by 𝜖, then a linear function with weights𝑤 can change by as much as 𝜖 𝒘 0 , which can be a very largeamount if 𝒘 is high-dimensional.Adversarial training discourages this highly sensitive locally linearbehavior by encouraging the network to be locally constant in theneighborhood of the training data.This can be seen as a way of explicitly introducing a localconstancy prior into supervised neural nets.q The classifier may then be trained to assign the same label to 𝒙and 𝒙’.q The assumption motivating this approach is that differentclasses usually lie on disconnected manifolds, and a smallperturbation should not be able to jump from one class manifoldto another class manifold.(Goodfellow 2016)

Conclusion This chapter has described most of the generalstrategies used to regularize neural networks. Regularization is a central theme of machinelearningOur next topic is: optimization(Goodfellow 2016)

(Goodfellow 2016) Early stopping probably the most commonly used form of regularization in deep learning. qthe number of training steps (or training time) is just another hyperparameter. The cost is running the validation set evaluation periodically during training qReduce the validation set qEvaluate the validation loss less frequently