Fei-Fei Li & Justin Johnson & Serena Yeung - Stanford University

Transcription

Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 1May 1, 2018

AdministrativeA2 due Wed May 2Midterm: In-class Tue May 8. Covers material throughLecture 10 (Thu May 3).Sample midterm released on piazza.Midterm review session: Fri May 4 discussion sectionFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 2May 1, 2018

Last time: Deep learning frameworksPaddlePaddleChainer(Baidu)CaffeCaffe2(UC YU / Facebook)(Facebook)Developed by U Washington, CMU, MIT,Hong Kong U, etc but main framework ofchoice at AWS(Microsoft)Deeplearning4jTheanoTensorFlow(U Montreal)(Google)And others.Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 3May 1, 2018

Last time: Deep learning frameworks(1) Easily build big computational graphs(2) Easily compute gradients in computational graphs(3) Run it all efficiently on GPU (wrap cuDNN, cuBLAS, etc)Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 4May 1, 2018

Today: CNN ArchitecturesCase Studies-AlexNetVGGGoogLeNetResNetAlso.-NiN (Network in Network)Wide ResNetResNeXTStochastic DepthSqueeze-and-Excitation NetworkFei-Fei Li & Justin Johnson & Serena Yeung-DenseNetFractalNetSqueezeNetNASNetLecture 9 - 5May 1, 2018

Review: LeNet-5[LeCun et al., 1998]Conv filters were 5x5, applied at stride 1Subsampling (Pooling) layers were 2x2 applied at stride 2i.e. architecture is [CONV-POOL-CONV-POOL-FC-FC]Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 6May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Architecture:CONV1MAX POOL1NORM1CONV2MAX POOL2NORM2CONV3CONV4CONV5Max POOL3FC6FC7FC8Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 7May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Input: 227x227x3 imagesFirst layer (CONV1): 96 11x11 filters applied at stride 4 Q: what is the output volume size? Hint: (227-11)/4 1 55Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 8May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Input: 227x227x3 imagesFirst layer (CONV1): 96 11x11 filters applied at stride 4 Output volume [55x55x96]Q: What is the total number of parameters in this layer?Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 9May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Input: 227x227x3 imagesFirst layer (CONV1): 96 11x11 filters applied at stride 4 Output volume [55x55x96]Parameters: (11*11*3)*96 35KFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 10May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Input: 227x227x3 imagesAfter CONV1: 55x55x96Second layer (POOL1): 3x3 filters applied at stride 2Q: what is the output volume size? Hint: (55-3)/2 1 27Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 11May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Input: 227x227x3 imagesAfter CONV1: 55x55x96Second layer (POOL1): 3x3 filters applied at stride 2Output volume: 27x27x96Q: what is the number of parameters in this layer?Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 12May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Input: 227x227x3 imagesAfter CONV1: 55x55x96Second layer (POOL1): 3x3 filters applied at stride 2Output volume: 27x27x96Parameters: 0!Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 13May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Input: 227x227x3 imagesAfter CONV1: 55x55x96After POOL1: 27x27x96.Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 14May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 15May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)Fei-Fei Li & Justin Johnson & Serena YeungDetails/Retrospectives:- first use of ReLU- used Norm layers (not common anymore)- heavy data augmentation- dropout 0.5- batch size 128- SGD Momentum 0.9- Learning rate 1e-2, reduced by 10manually when val accuracy plateaus- L2 weight decay 5e-4- 7 CNN ensemble: 18.2% - 15.4%Lecture 9 - 16May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)[55x55x48] x 2Fei-Fei Li & Justin Johnson & Serena YeungHistorical note: Trained on GTX 580GPU with only 3 GB of memory.Network spread across 2 GPUs, halfthe neurons (feature maps) on eachGPU.Lecture 9 - 17May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)Fei-Fei Li & Justin Johnson & Serena YeungCONV1, CONV2, CONV4, CONV5:Connections only with feature mapson same GPULecture 9 - 18May 1, 2018

Case Study: AlexNet[Krizhevsky et al. 2012]Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)Fei-Fei Li & Justin Johnson & Serena YeungCONV3, FC6, FC7, FC8:Connections with all feature maps inpreceding layer, communicationacross GPUsLecture 9 - 19May 1, 2018

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners152 layers 152 layers 152 layers19 layersshallow8 layers22 layers8 layersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 20May 1, 2018

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners152 layers 152 layers 152 layersFirst CNN-based winner19 layersshallow8 layers22 layers8 layersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 21May 1, 2018

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winnersZFNet: Improvedhyperparameters overAlexNet19 layersshallow8 layers152 layers 152 layers 152 layers22 layers8 layersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 22May 1, 2018

ZFNet[Zeiler and Fergus, 2013]AlexNet but:CONV1: change from (11x11 stride 4) to (7x7 stride 2)CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512ImageNet top 5 error: 16.4% - 11.7%Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 23May 1, 2018

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners152 layers 152 layers 152 layersDeeper Networks19 layersshallow8 layers22 layers8 layersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 24May 1, 2018

Case Study: VGGNet[Simonyan and Zisserman, 2014]Small filters, Deeper networks8 layers (AlexNet)- 16 - 19 layers (VGG16Net)Only 3x3 CONV stride 1, pad 1and 2x2 MAX POOL stride 211.7% top 5 error in ILSVRC’13(ZFNet)- 7.3% top 5 error in ILSVRC’14Fei-Fei Li & Justin Johnson & Serena YeungAlexNetVGG16Lecture 9 - 25VGG19May 1, 2018

Case Study: VGGNet[Simonyan and Zisserman, 2014]Q: Why use smaller filters? (3x3 conv)AlexNetFei-Fei Li & Justin Johnson & Serena YeungVGG16Lecture 9 - 26VGG19May 1, 2018

Case Study: VGGNet[Simonyan and Zisserman, 2014]Q: Why use smaller filters? (3x3 conv)Stack of three 3x3 conv (stride 1) layershas same effective receptive field asone 7x7 conv layerQ: What is the effective receptive field ofthree 3x3 conv (stride 1) layers?AlexNetFei-Fei Li & Justin Johnson & Serena YeungVGG16Lecture 9 - 27VGG19May 1, 2018

Case Study: VGGNet[Simonyan and Zisserman, 2014]Q: Why use smaller filters? (3x3 conv)Stack of three 3x3 conv (stride 1) layershas same effective receptive field asone 7x7 conv layer[7x7]AlexNetFei-Fei Li & Justin Johnson & Serena YeungVGG16Lecture 9 - 28VGG19May 1, 2018

Case Study: VGGNet[Simonyan and Zisserman, 2014]Q: Why use smaller filters? (3x3 conv)Stack of three 3x3 conv (stride 1) layershas same effective receptive field asone 7x7 conv layerBut deeper, more non-linearitiesAnd fewer parameters: 3 * (32C2) vs.72C2 for C channels per layerAlexNetFei-Fei Li & Justin Johnson & Serena YeungVGG16Lecture 9 - 29VGG19May 1, 2018

(not counting biases)INPUT: [224x224x3]memory: 224*224*3 150K params: 0CONV3-64: [224x224x64] memory: 224*224*64 3.2M params: (3*3*3)*64 1,728CONV3-64: [224x224x64] memory: 224*224*64 3.2M params: (3*3*64)*64 36,864POOL2: [112x112x64] memory: 112*112*64 800K params: 0CONV3-128: [112x112x128] memory: 112*112*128 1.6M params: (3*3*64)*128 73,728CONV3-128: [112x112x128] memory: 112*112*128 1.6M params: (3*3*128)*128 147,456POOL2: [56x56x128] memory: 56*56*128 400K params: 0CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*128)*256 294,912CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*256)*256 589,824CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*256)*256 589,824POOL2: [28x28x256] memory: 28*28*256 200K params: 0CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*256)*512 1,179,648CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*512)*512 2,359,296CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*512)*512 2,359,296POOL2: [14x14x512] memory: 14*14*512 100K params: 0CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296POOL2: [7x7x512] memory: 7*7*512 25K params: 0FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 102,760,448FC: [1x1x4096] memory: 4096 params: 4096*4096 16,777,216FC: [1x1x1000] memory: 1000 params: 4096*1000 4,096,000Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 30VGG16May 1, 2018

(not counting biases)INPUT: [224x224x3]memory: 224*224*3 150K params: 0CONV3-64: [224x224x64] memory: 224*224*64 3.2M params: (3*3*3)*64 1,728CONV3-64: [224x224x64] memory: 224*224*64 3.2M params: (3*3*64)*64 36,864POOL2: [112x112x64] memory: 112*112*64 800K params: 0CONV3-128: [112x112x128] memory: 112*112*128 1.6M params: (3*3*64)*128 73,728CONV3-128: [112x112x128] memory: 112*112*128 1.6M params: (3*3*128)*128 147,456POOL2: [56x56x128] memory: 56*56*128 400K params: 0CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*128)*256 294,912CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*256)*256 589,824CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*256)*256 589,824POOL2: [28x28x256] memory: 28*28*256 200K params: 0CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*256)*512 1,179,648CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*512)*512 2,359,296CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*512)*512 2,359,296POOL2: [14x14x512] memory: 14*14*512 100K params: 0CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296POOL2: [7x7x512] memory: 7*7*512 25K params: 0FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 102,760,448FC: [1x1x4096] memory: 4096 params: 4096*4096 16,777,216FC: [1x1x1000] memory: 1000 params: 4096*1000 4,096,000VGG16TOTAL memory: 24M * 4 bytes 96MB / image (for a forward pass)TOTAL params: 138M parametersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 31May 1, 2018

(not counting biases)INPUT: [224x224x3]memory: 224*224*3 150K params: 0CONV3-64: [224x224x64] memory: 224*224*64 3.2M params: (3*3*3)*64 1,728CONV3-64: [224x224x64] memory: 224*224*64 3.2M params: (3*3*64)*64 36,864POOL2: [112x112x64] memory: 112*112*64 800K params: 0CONV3-128: [112x112x128] memory: 112*112*128 1.6M params: (3*3*64)*128 73,728CONV3-128: [112x112x128] memory: 112*112*128 1.6M params: (3*3*128)*128 147,456POOL2: [56x56x128] memory: 56*56*128 400K params: 0CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*128)*256 294,912CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*256)*256 589,824CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*256)*256 589,824POOL2: [28x28x256] memory: 28*28*256 200K params: 0CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*256)*512 1,179,648CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*512)*512 2,359,296CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*512)*512 2,359,296POOL2: [14x14x512] memory: 14*14*512 100K params: 0CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296POOL2: [7x7x512] memory: 7*7*512 25K params: 0FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 102,760,448FC: [1x1x4096] memory: 4096 params: 4096*4096 16,777,216FC: [1x1x1000] memory: 1000 params: 4096*1000 4,096,000Note:Most memory is inearly CONVMost params arein late FCTOTAL memory: 24M * 4 bytes 96MB / image (only forward! *2 for bwd)TOTAL params: 138M parametersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 32May 1, 2018

(not counting biases)INPUT: [224x224x3]memory: 224*224*3 150K params: 0CONV3-64: [224x224x64] memory: 224*224*64 3.2M params: (3*3*3)*64 1,728CONV3-64: [224x224x64] memory: 224*224*64 3.2M params: (3*3*64)*64 36,864POOL2: [112x112x64] memory: 112*112*64 800K params: 0CONV3-128: [112x112x128] memory: 112*112*128 1.6M params: (3*3*64)*128 73,728CONV3-128: [112x112x128] memory: 112*112*128 1.6M params: (3*3*128)*128 147,456POOL2: [56x56x128] memory: 56*56*128 400K params: 0CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*128)*256 294,912CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*256)*256 589,824CONV3-256: [56x56x256] memory: 56*56*256 800K params: (3*3*256)*256 589,824POOL2: [28x28x256] memory: 28*28*256 200K params: 0CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*256)*512 1,179,648CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*512)*512 2,359,296CONV3-512: [28x28x512] memory: 28*28*512 400K params: (3*3*512)*512 2,359,296POOL2: [14x14x512] memory: 14*14*512 100K params: 0CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296CONV3-512: [14x14x512] memory: 14*14*512 100K params: (3*3*512)*512 2,359,296POOL2: [7x7x512] memory: 7*7*512 25K params: 0FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 102,760,448FC: [1x1x4096] memory: 4096 params: 4096*4096 16,777,216FC: [1x1x1000] memory: 1000 params: 4096*1000 4,096,000TOTAL memory: 24M * 4 bytes 96MB / image (only forward! *2 for bwd)TOTAL params: 138M parametersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 33VGG16Common namesMay 1, 2018

Case Study: VGGNet[Simonyan and Zisserman, 2014]Details:- ILSVRC’14 2nd in classification, 1st inlocalization- Similar training procedure as Krizhevsky2012- No Local Response Normalisation (LRN)- Use VGG16 or VGG19 (VGG19 onlyslightly better, more memory)- Use ensembles for best results- FC7 features generalize well to othertasksAlexNetFei-Fei Li & Justin Johnson & Serena YeungVGG16Lecture 9 - 34VGG19May 1, 2018

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners152 layers 152 layers 152 layersDeeper Networks19 layersshallow8 layers22 layers8 layersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 35May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Deeper networks, with computationalefficiency-22 layersEfficient “Inception” moduleNo FC layersOnly 5 million parameters!12x less than AlexNetILSVRC’14 classification winner(6.7% top 5 error)Fei-Fei Li & Justin Johnson & Serena YeungInception moduleLecture 9 - 36May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]“Inception module”: design agood local network topology(network within a network) andthen stack these modules ontop of each otherInception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 37May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Apply parallel filter operations onthe input from previous layer:- Multiple receptive field sizesfor convolution (1x1, 3x3,5x5)- Pooling operation (3x3)Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungConcatenate all filter outputstogether depth-wiseLecture 9 - 38May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Apply parallel filter operations onthe input from previous layer:- Multiple receptive field sizesfor convolution (1x1, 3x3,5x5)- Pooling operation (3x3)Naive Inception moduleConcatenate all filter outputstogether depth-wiseQ: What is the problem with this?[Hint: Computational complexity]Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 39May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Q: What is the problem with this?[Hint: Computational complexity]Example:Module input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 40May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:Q: What is the problem with this?[Hint: Computational complexity]Q1: What is the output size of the1x1 conv, with 128 filters?Module input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 41May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:Q: What is the problem with this?[Hint: Computational complexity]Q1: What is the output size of the1x1 conv, with 128 filters?28x28x128Module input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 42May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:Q: What is the problem with this?[Hint: Computational complexity]Q2: What are the output sizes ofall different filter operations?28x28x128Module input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 43May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:28x28x128Q: What is the problem with this?[Hint: Computational complexity]Q2: What are the output sizes ofall different filter operations?28x28x19228x28x9628x28x256Module input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 44May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:28x28x128Q: What is the problem with this?[Hint: Computational complexity]Q3:What is output size afterfilter concatenation?28x28x19228x28x9628x28x256Module input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 45May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:Q: What is the problem with this?[Hint: Computational complexity]Q3:What is output size afterfilter concatenation?28x28x(128 192 96 256) 28x28x67228x28x12828x28x19228x28x9628x28x256Module input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 46May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:Q3:What is output size afterfilter concatenation?28x28x(128 192 96 256) 28x28x67228x28x128Q: What is the problem with this?[Hint: Computational complexity]28x28x19228x28x9628x28x256Conv Ops:[1x1 conv, 128] 28x28x128x1x1x256[3x3 conv, 192] 28x28x192x3x3x256[5x5 conv, 96] 28x28x96x5x5x256Total: 854M opsModule input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 47May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:Q3:What is output size afterfilter concatenation?28x28x(128 192 96 256) 28x28x67228x28x128Q: What is the problem with this?[Hint: Computational complexity]28x28x19228x28x9628x28x256Conv Ops:[1x1 conv, 128] 28x28x128x1x1x256[3x3 conv, 192] 28x28x192x3x3x256[5x5 conv, 96] 28x28x96x5x5x256Total: 854M opsVery expensive computeModule input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungPooling layer also preserves featuredepth, which means total depth afterconcatenation can only grow at everylayer!Lecture 9 - 48May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Example:Q3:What is output size afterfilter concatenation?28x28x(128 192 96 256) 529k28x28x128Q: What is the problem with this?[Hint: Computational complexity]28x28x19228x28x9628x28x256Solution: “bottleneck” layers thatuse 1x1 convolutions to reducefeature depthModule input:28x28x256Naive Inception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 49May 1, 2018

Reminder: 1x1 convolutions56561x1 CONVwith 32 filters56(each filter has size1x1x64, and performs a64-dimensional dotproduct)64Fei-Fei Li & Justin Johnson & Serena Yeung5632Lecture 9 - 50May 1, 2018

Reminder: 1x1 convolutions561x1 CONVwith 32 filters56preserves spatialdimensions, reduces depth!5664Projects depth to lowerdimension (combination offeature maps)Fei-Fei Li & Justin Johnson & Serena Yeung5632Lecture 9 - 51May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Naive Inception moduleInception module with dimension reductionFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 52May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]1x1 conv “bottleneck”layersNaive Inception moduleInception module with dimension reductionFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 53May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Using same parallel layers asnaive example, and adding “1x1conv, 64 filter” 28x9628x28x6428x28x6428x28x256Module input:28x28x256Conv Ops:[1x1 conv, 64] 28x28x64x1x1x256[1x1 conv, 64] 28x28x64x1x1x256[1x1 conv, 128] 28x28x128x1x1x256[3x3 conv, 192] 28x28x192x3x3x64[5x5 conv, 96] 28x28x96x5x5x64[1x1 conv, 64] 28x28x64x1x1x256Total: 358M opsCompared to 854M ops for naive versionBottleneck can also reduce depth afterpooling layerInception module with dimension reductionFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 54May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Stack Inception moduleswith dimension reductionon top of each otherInception moduleFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 55May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Full GoogLeNetarchitectureStem Network:Conv-Pool2x Conv-PoolFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 56May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Full GoogLeNetarchitectureStacked InceptionModulesFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 57May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Full GoogLeNetarchitectureClassifier outputFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 58May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Full GoogLeNetarchitectureClassifier output(removed expensive FC layers!)Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 59May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Full GoogLeNetarchitectureAuxiliary classification outputs to inject additional gradient at lower layers(AvgPool-1x1Conv-FC-FC-Softmax)Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 60May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Full GoogLeNetarchitecture22 total layers with weights(parallel layers count as 1 layer 2 layers per Inception module. Don’t count auxiliary output layers)Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 61May 1, 2018

Case Study: GoogLeNet[Szegedy et al., 2014]Deeper networks, with computationalefficiency-22 layersEfficient “Inception” moduleNo FC layers12x less params than AlexNetILSVRC’14 classification winner(6.7% top 5 error)Fei-Fei Li & Justin Johnson & Serena YeungInception moduleLecture 9 - 62May 1, 2018

ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners“Revolution of Depth”152 layers 152 layers 152 layers19 layersshallow8 layers22 layers8 layersFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 63May 1, 2018

Case Study: ResNet[He et al., 2015]reluVery deep networks using residualconnectionsF(x) x.-152-layer model for ImageNetILSVRC’15 classification winner(3.57% top 5 error)Swept all classification anddetection competitions inILSVRC’15 and COCO’15!F(x)Fei-Fei Li & Justin Johnson & Serena YeungreluXidentityXResidual blockLecture 9 - 64May 1, 2018

Case Study: ResNet[He et al., 2015]What happens when we continue stacking deeper layers on a “plain” convolutionalneural network?Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 65May 1, 2018

Case Study: ResNet[He et al., 2015]What happens when we continue stacking deeper layers on a “plain” convolutionalneural network?Q: What’s strange about these training and test curves?[Hint: look at the order of the curves]Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 66May 1, 2018

Case Study: ResNet[He et al., 2015]What happens when we continue stacking deeper layers on a “plain” convolutionalneural network?56-layer model performs worse on both training and test error- The deeper model performs worse, but it’s not caused by overfitting!Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 67May 1, 2018

Case Study: ResNet[He et al., 2015]Hypothesis: the problem is an optimization problem, deeper models are harder tooptimizeFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 68May 1, 2018

Case Study: ResNet[He et al., 2015]Hypothesis: the problem is an optimization problem, deeper models are harder tooptimizeThe deeper model should be able to perform atleast as well as the shallower model.A solution by construction is copying the learnedlayers from the shallower model and settingadditional layers to identity mapping.Fei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 69May 1, 2018

Case Study: ResNet[He et al., 2015]Solution: Use network layers to fit a residual mapping instead of directly trying to fit adesired underlying mappingreluH(x)reluF(x) xF(x)X“Plain” layersFei-Fei Li & Justin Johnson & Serena YeungreluXidentityXResidual blockLecture 9 - 70May 1, 2018

Case Study: ResNet[He et al., 2015]Solution: Use network layers to fit a residual mapping instead of directly trying to fit adesired underlying mappingH(x)reluH(x) F(x) xreluF(x) xF(x)X“Plain” layersFei-Fei Li & Justin Johnson & Serena YeungreluXidentityUse layers tofit residualF(x) H(x) - xinstead ofH(x) directlyXResidual blockLecture 9 - 71May 1, 2018

Case Study: ResNet[He et al., 2015]Full ResNet architecture:- Stack residual blocks- Every residual block hastwo 3x3 conv layersreluF(x) x.F(x)reluXidentityXResidual blockFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 72May 1, 2018

Case Study: ResNet[He et al., 2015]Full ResNet architecture:- Stack residual blocks- Every residual block hastwo 3x3 conv layers- Periodically, double # offilters and downsamplespatially using stride 2(/2 in each dimension)reluF(x) x.F(x)reluXidentity3x3 conv, 128filters, /2spatially withstride 23x3 conv, 64filtersXResidual blockFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 73May 1, 2018

Case Study: ResNet[He et al., 2015]Full ResNet architecture:- Stack residual blocks- Every residual block hastwo 3x3 conv layers- Periodically, double # offilters and downsamplespatially using stride 2(/2 in each dimension)- Additional conv layer atthe beginningreluF(x) x.F(x)reluXidentityXResidual blockBeginningconv layerFei-Fei Li & Justin Johnson & Serena YeungLecture 9 - 74May 1, 2018

Case Study: ResNetNo FC layersbesides FC1000 tooutputclasses[He et al., 2015]Full ResNet architecture:- Stack residual blocks- Every residual block hastwo 3x3 conv layers- Periodically, double # offilters and downsamplespatially using stride 2(/2 in each dimension)- Additional conv layer atthe beginning- No FC

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 12 May 1, 2018 Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images After CONV1: 55x55x96 Second layer (POOL1): 3x3 filters applied at stride 2 Output volume: 27x27x96 Q: what is the number of parameters in this layer?