Learning Spatiotemporal Features With 3D Convolutional .

Transcription

Learning Spatiotemporal Featureswith 3D Convolutional NetworksWritten by: Du Tran, Lubomir Bourdev, Rob Fergus and Lorenzo TorresaniPresented by: Leonardo FerrerECS 289G - October 27, 2016

MotivationImage credit: George Seffers (2014): tyles/flexslider g?itok ER6dwat0Garry Nichols (N/A): 0/computer-processing-paper-with-funnel/?&results per page 1&detail TRUE&page 50

The need for a video descriptorGenericCompactEfficientSimple

Contributions: The 3D ConvNetImage credit: Du Tran (2014) http://blog.dutran.org/?p 252

From 2D to 3D ConvolutionImage obtained from the paper.

Architecture For 3D ConvNets (On UCF101) Common network settings:– All video frames are resized to 128x171 pixels.– Videos are split into non-overlapping 16 frame clips.– Random jittering is used to crop the frames to 112x112.– Input: 3x16x112x112 volume.– The network architecture is similar to AlexNet but allconvolutional layers have attached pooling layers.– Softmax loss layer to predict action labels.Original slide from: Çağdaş Bak (2016) http://web.cs.hacettepe.edu.tr/ pdf

Architecture For 3D ConvNets (On UCF101) Common network settings (continued):– All max pooling kernels are size 2x2x2 except for the first onewhich does not collapse the signal temporarily– The two fully connected layers have 2048 outputs.– Trained with mini-batch SGD with batches of 30 clips– The used learning rate was 0.003.– Learning rate divided by 10 every 4 epochs.– Training lasted 16 epochs.

Architecture For 3D ConvNets (On UCF101) Varying Network Architecture:– The convolution kernels operate d frames at a time.– Homogeneous temporal depth. Depth –d for 1,3,5,7– Varying temporal depth. Increasing : 3-3-5-5-7 Decreasing : 7-5-5-3-3Original slide from: Çağdaş Bak (2016) http://web.cs.hacettepe.edu.tr/ pdf

3D Convolution Kernel Temporal Depth SearchOriginal slide from: Çağdaş Bak (2016) http://web.cs.hacettepe.edu.tr/ pdf

The C3D Architecture 3D Convolution filters are 3 3 3 with stride 1 1 1. 3D pooling layers are 2 2 2 with stride 2 2 2 (except for pool1which has kernel size of 1 2 2 and stride 1 2 2). Each fully connected layer has 4096 output units. This architecture was the largest the could fit given GPU memory.Image obtained from the paper.

Video Classification: Sports-1M Results DeepVideo and C3D have comparable inputs (16 frames). Convolution Pooling uses a much longer clip length, hence its better results. Since Convolution Pooling is already using long videos, clip hit (2 seconds) accuracyis not comparable.Table obtained from the paper.

Action Recognition: UCF101 Results For (improved DenseTrajectories) iDT, they used thebag-of-word representation witha codebook size of 5000 for eachfeature channel of iDT which aretrajectories, HOG, HOF, MBHx,and MBHy Best performance with RGB onlyand RGB features.Table obtained from the paper.

Action Similarity Labeling: ASLAN ResultsImages obtained from the paper.

Scene and Object Recognition: YUPENN andMaryland Results The model was trained on Sports-1M, whereas both of these datasetsare made of egocentric videos. There was no fine-tuning and C3D uses half of the video resolutionavailable.Table obtained from the paper.

Performance Analysis on UCF101 Brox’s algorithm for Optical Flow is a necessary input for a baselinecomparison. iDT is not available on GPU.Table obtained from the paper.

What doesC3D learn?Conv2 learns edges,orientation andcolor changes.Image obtained from the paper.

What doesC3D learn?Conv3 learnsmoving corners,textures, objectsand body partsImage obtained from the paper.

What doesC3D learn?Conv5 learnsmotion. In thisexample, movingcircular objectsImage obtained from the paper.

What doesC3D learn?Conv5 learnsmotion. In thisexample, biking-likemotionImage obtained from the paper.

What doesC3D learn?Conv5 learnsmotion. In thisexample, applyingmake-up orbrushing teethImage obtained from the paper.

What doesC3D learn?Conv5 learnsmotion. In thisexample, balancebeam like motionsImage obtained from the paper.

What does C3D learn? A visualization of conv5 features: from appearance to motionImage obtained from the paper.

What doesC3D learn?C3D optimizessalient motionswhile optical flowencodes allmovementImage obtained from the paper.

How does C3D describe videos? C3D Video Descriptor: A 4096 long descriptor that is obtained fromaveraged fc6 activations.Images obtained from the paper.

How does resolution affect C3D? net-128 provides a goodcompromise betweenaccuracy and performanceImages obtained from the paper.

Strengths and Weaknesses High performance method which allows for real-time applications. Features are generalizable and allow for transfer learning approaches. Simple approach with state-of-the-art results. Input clips are too short for some applications. The resolution is too low.

Applications and Research Questions Automatic video summarization Video highlights for events Adding iDT improved performance - What happens with other typesof descriptors or derived information? How does the fc layers alter performance?

Conclusions 3D ConvNets are superior to 2D ConvNets for video tasks becausethey model appearance and motion simultaneously. The ideal temporal kernel depth is 3, which matches spatial kerneldepth. C3D features paired with an SVM perform very well in several videoclassification tasks

Links Source code: http://vlg.cs.dartmouth.edu/c3d https://github.com/facebook/C3D More information: http://web.cs.hacettepe.edu.tr/ pdf http://blog.dutran.org/?p 252 A paper that improves this one: https://arxiv.org/pdf/1604.06573v2.pdf

Architecture For 3D ConvNets (On UCF101) Common network settings (continued): –All max pooling kernels are size 2x2x2 except for the first one which does not collapse the signal temporarily –The two fully connected layers have 2048 outputs. –Trained with mini-batch SGD wi