Deep Learning: Project Final Report - Stanford University

Transcription

Deep Learning: Project Final ReportCasey Long Department of Computer ScienceStanford Universityclong80@stanford.eduAbstractDeep learning neural networks aid in side channel cryptanalysis against AES-128implementations running on an 8-bit RISC (AVR) CPU architecture.1IntroductionPower Analysis side-channel attacks correlate power consumption of crytographic operations toestimate both a key value and its timestamp in a power trace. A trace refers to a set of powerconsumption measurements taken across a crytographic operation. Often, this is depicted as an X-Yplot, with current or voltage on the Y-axis and time on the X-axis.A Simple Power Analysis (SPA) involves directly interpreting the visual trace. Because block cipherencryption algorithms like AES are deterministic and public, correlating the power consumption tocertain crytographic operations can reveal execution and data path points of the algorithm. The roleof the crytographic engineer is to prevent leakage of crytographic operations in traces to an adversary.Often, this employs the use of power reduction to minimizes signal strength or introduction of noiseto minimizes measurement strength.More advanced power analysis side-channel attack takes advantage of large datasets of traces tomeasure small variations of power consumption. These variations are not intuitively obvious anddifferences are expressed in terms of covariances.2BackgroundA more thorough investigation of the AES-128 algorithm is deferred to other papers. The salientpoints of the AES-128 algorithm important to this paper are expressed: There are four main functions; add round key, substitute bytes, shift rows, andmix columns. They are permutations of each other; i.e., they are chained together. The add round key and substitute bytes functions are of particular interest to sidechannel analysis, and involve a bitwise XOR and constant-time table lookup, respectively.This table is known as the Rijndael S-Box. substitute bytes represents a non-linear mapping that breaks the 128-bit key producedby add round key into 16 bytes, which serve as the index to the S-Box lookup. Becauseof its non-linearity, it is difficult to alter or safeguard this function while preserving thismapping. In addition to the small 8-bit index, substitute bytes is a source of weakness. Stanford CS230 student.CS230: Deep Learning, Spring 2020, Stanford University, CA. (LateX template borrowed from NIPS 2017.)

During a DPA attack, an attacker targets a single 8-bit block. This is what gives DPA attacks theirstrength; brute force complexity is reduced from O(2128 ) to O(16 · 28 ) O(28 ).The key schedule for the AES-128 algorithm is provided.Here is a general procedure of a differential attack. We assume a priori knowledge of the plaintext or ciphertext (or both) values for a fixed keyvalue and the algorithm used. Intuitively, this means we have access to the device underattack (DUT), and are able to monitor the encryption or decryption process. This is possiblewith things such as encrypted bootloaders where we can continually reset the device, orcommercial-off-the-shelf equipment that is not unique. We want to calculate the maximum likelihood estimate (MLE) for the key-byte. Because itis only 8-bits, we can brute force this. Thus, we have 28 256 possible "classes" that wecan bin each estimate into. For each key estimate, we correlate it to the power trace. This correlation uses some leakagefunction that maps the key-byte to a power intensity. A common choice for this leakagefunction is the Hamming Weight. The intuition here, is that an "on" or "1" bit is related topower consumption.3Related workMany power analysis algorithms can provide an MLE. The ones discussed in this paper includea Correlation Power Analysis (CPA), Linear Regression Analysis (LRA), a Multilayer Perceptron(MLP), and a Convolutional Neural Network (CNN).29 one of the most important works in the field of side-channel analysis. Published in 1996,this seminal work introduces one of the first feasible concepts of statistical analysis of tracedatasets to attack a microcontroller, called a Differential Power Analysis. Many other poweranalysis algorithms such as CPA and LRA are directly derived from this work. While DPAin this work refers to a specific Difference of Means algorithm to measure differences, theword "differential power analysis" is often interchangably used in any power analysis attackthat uses some statistical difference measurement to gain inference about the MLE key-byte.28 introduces a Correlational Power Analysis.26 introduces a Linear Regression Analysis. Additionally, a literature review indicates sucess using Support Vector Machines (SVM)[14][15][16][17], Random Forests [17][18], Multilayer Perceptrons (MLP) [19][20][21],and Convolutional Neural Networks (CNN) [22][23]. Due to relevancy of the last two indeep learning, more attention will be emphasized there.44.1Dataset and FeaturesDescriptionThe dataset consists of 60,000 AES-128 power traces extracted from ATMega8515 (AVR architecture)microcontroller, partitioned into 10,000 test and 50,000 train cases. It is a time series dataset. Eachdata point consists of three groups of information: traces: contains an index number, with a timestamp and raw power measurement labels: the AES substitution box (i.e., a Rijndael S-box) values. We denote S(p k) as thesubstition box, where p is our plaintext value k is our key value. metadata: associated with every timestamp is the truth values for the plaintext, ciphertext,key, and mask used during that timestamp. A mask is an obfuscation technique to protectAES implementations by randomizing the intermediate results, thus creating noise to powertraces. Not all traces are masked.This 5Gb dataset is freely available from the National Cybersecurity Agency of France (Agencenationale de la sécurité des systèmes d’information, ANSSI). The ANSSI Side Channel Attack2

Figure 1: snr43

Database (ASCAD) is in HDF5 format, which can be parsed with Python’s hdf5 package. ANSSIdeveloped this database with the intention of it becoming a MNIST-like library for side-channelattacks. Support for Keras, Tensorflow, and GPU acceleration is provided.4.2Dimensionality ReductionAccording to the Nyquist-Shannon sampling theorom, the measurement frequency must be higher thanthe measured device under attack frequency (i.e., the clock rate). Often times, measurement frequencymay be in the GHz range for microcontrollers in the MHz. This is often due to the resolution requiredfor a certain attack and the low power emitted from such devices that make useful measurementssparse, and most measurements noisy. [] lists various dimensionality reduction techniques to honein onto Points of Interest. These include Difference of Means based methods (DOM), Sum ofSqaured Differences (SOSD), Correlation Power Analysis based methods (CPA), Sum of Squaredpairwise T-differences (SOST), Signal-to-Noise ratio (SNR), Variance based methods (VAR), MutualInformation Analysis (MIA), and Kolmogorov-Smirnov Analysis (KSA), and Principal ComponentAnalysis (PCA).The SNR method was chosen as the ASCAD authors also used this method. The SNR is generallydefined as,SN R SignalN oise(1)SN R σx̄ µx̄µσ2(2)In this paper, it is specifically defined asIntuitively, the numerator represents the difference in the mean of different class means and thevariance of those average means, while the denominator represents the mean of the variances. Thegraph show is the result, and is a replication of what ASCAD similarily produced. The range of[45400, 46100] was chosen because leakage model functions snr4 and snr5 were too simple. Theirrepresentations are easy to spot. The function snr1 was not able to be seen, as this was considered acryptographically secure implementation with no first-order leakage (i.e., simple linear attacks willnot be sufficient for predicting key byte values).5MethodsA Correlation Power Analysis and Linear Regression Analysis were used to baseline the model.This was to help compare the MLP and CNN models to older and more traditional models to giveperspective on their effacacy. CPA: The CPA is a collary of the Pearson correlation coefficient:cov(X, Y )σX σYPPPN T ri Hi T ri Hip p P 2PPPN T ri ( T ri )2 N Hi ( Hi )2ρX,Y (3)(4)where T ri is our trace at sample index i, s.t. i {0, ., Ntraces } and Hi is shorthand for HammingW eight(sbox(Pi Ki )) Intuitively, we are trying to map the rawpower intensity to the Hamming Weight of the bits produced by the add round key andsubstitute bytes functions. LRA: An LRA is conceptually similar to any other linear regression. We are trying to fitour key-byte classes to a spline, instead of a fixed dimensional polynomial. In this case,we create basis functions in the matrix M . The mathematics behind these functions delvemore into crytography and finite fields, so we won’t delve into that rabbit hole. The bigtake away is that we create a coefficient matrix that we can apply to our trace data. These4

splines represent a fit approximation to each of the 256 classes that 8-bit target key byte canrepresent. sbox(Pi Ki )b11 . . . sbox(Pi Ki )b1t .M (5) .sbox(Pi Ki )bN1sbox(Pi Ki )bNtβ (M T M ) 1 M T T r(6)The goodness of fit measure is a scalar representation from {0, ., 1} of how close ameasurement value from Tr is to our model M · βP(T r M β)2R2 1 P(7)(T ri T r)2 Tr M · β 1 (8)σT r MLP: The MLP used has a total of 6 layers used. The first input layer consists of 700 notes–this is due to the Point of Interest interval of [45400, 46100]. This input layer represents thepower intensities for that given time range. The next four layers are hidden layers of 200nodes each. The final output layer is 256 node layer, representing the MLE prediction ofthe byte. (i.e., it’s predicting each individual bit) A categorical cross entropy loss functionwas used. This was because the categorical cross entropy loss function weights the true"class" (our byte) with a value of 1.0, and all other classes of 0.0. This is useful becausecrytography has to have a deterministic and exact key-byte value; we would like to weightnon-true answers as low as possible. The activation function for the four hidden layers wasReLu, with a softmax activation for the final layer. The optimizer was RMSProp–no specificreason why this was chosen. The number of nodes and hidden layers seem to have a moredisceranable impact on performance for these side-channel attacks, as discussed in. A greatdeal of time was instead devoted towards comparing MLP to other older/traditional methodsof inference. The results are shown.6Experiments/Results/DiscussionFirst-order side channel attacks are those that exploit differences in means. . A Linear RegressionAnalysis (LRA) and a Correlation Power Analysis (CPA) are examples of first order attacks. Anattack implementation of the leakage functions snr4, snr2, snr1 is provided. While no graphs wereprovided in the ASCAD paper to reference, the qualitative descriptions coincide with the results Iachieved. Mainly, that in terms cryptographical secureness, snr1 precedes snr2, which precedes snr4.In these graphs, the grey represents an overlay of the estimated attack traces (the 256 key byteguesses). The green, purple, or blue represents the MLE key byte. The red represents the correct keybyte if the MLE key was guessed incorrectly. It should not be surprising that the attacks on snr1 donot converge. It was deemed a crytogrpahically secure implementation on the 8-bit microcontrollerwith no first-order leakage. You can easily see that there are no big spikes like there are in snr2or snr4, and most of the blue represents consistently within the gray attack traces, indicating itsoperation for correct/incorrect key values is consistent.The MLP result is more telling. In snr4 it quickly converges towards the solution in one iteration.Even more surprising, the previous stronger snr1 leakage function was able to be broken and the keycorrectly guessed.5

Figure 2: snr4Figure 3: snr46

Figure 4: snr47

7Conclusion/Future WorkUnfortunately running out of time to do more work on the CNN. This project was quite a bit of work,delving into more details of Crytography and Machine Learning! It was a good learning experience,but the scope of this project is far to large for a single course.References[1] Szefer, J. (2018) Principles of Secure Processor Architecture Design. In Martonosi, M. & Hill, M.D. (eds.)Synthesis Lectures on Computer Architecture. pp. 10.[2] Lee, R.B., Kwan, P., McGregor, J.P., Dwoskin, J. & Wang, Z. (2013) Security Basics for ComputerArchitects. Synthesis Lectures on Computer Architecture. 8(4) pp. 1-111.[3] Lipp, M., Schwarz, M., Gruss, D., Prescher, T., Haas, W., Mangard, S., Kocher, P., Genkin, D., Yarom, Y. &Hamburg, M. (2018) Meltdown: Reading Kernel Memory from User Space. 27th USENIX Security Symposium,Baltimore, MD, USA, August 15-17, 2018.[4] Kocher, P., Horn, J., Fogh, A., Genkin, D., Gruss, D., Haas, W., Hamburg, M., Lipp, M., Mangard, S.,Prescher, T., Schwarz, M. & Yarom, Y. & Hamburg, M. (2019) Spectre Attacks: Exploiting SpeculativeExecution. 40th IEEE Symposium on Security and Privacy (S&P’19).[5] Szefer, J. (2018) Principles of Secure Processor Architecture Design. In Martonosi, M. & Hill, M.D. (eds.)Synthesis Lectures on Computer Architecture. pp. 27-29.[6] Department of Homeland Security, National Cyber Awareness System. (2018) Alert (TA18-004A) Meltdownand Spectre Side-Channel Vulnerability Guidance. nal Release: January 4, 2018. Revised May 01, 2018. Retrieved April 25, 2020.[7] Graz University of Technology. (2018) Meltdown and Spectre Vulnerabilities in modern computers leakpasswords and sensitive data. https://meltdownattack.com/faq-advisory Retrieved April 25, 2020.[8] Intel. (2018) Advancing Security at the Silicon Level curity-silicon-level/#gs.4hdzye Original Release: March 15, 2018. Retrieved April25, 2020.[9] Microsoft. (2018) ADV180002 Guidance to mitigate speculative execution side-channel vulnerabilities. uidance/advisory/ADV180002 Original Release: January 3, 2018. Revised June 14, 2019. Retrieved April 25, 2020.[10] The Linux Kernel. (2020) Hardware vulnerabilities, Spectre Side Channels /hw-vuln/spectre.html Retrieved April 25, 2020.[11] Bursztein,E.& Picod,J-M. (2019) A Hacker Guide to Deep ide-channel-attacks/ Retrieved April 25, 2020.[12] Benadjila, R., Prouff, E., Strullu, R., Cagli, E. & Dumas, C. (2019). Deep learning for side-channel analysisand introduction to ASCAD database. Journal of Cryptographic Engineering.[13] Cagli, E. (2018). Feature Extraction for Side-Channel Attacks. Cryptography and Security [cs.CR]. pp.37-39. Sorbonne Université.[14] Bartkewitz, T. & Lenke-Rust . (2013) Efficient Template Attacks Based on Probabilistic Multi-class SupportVector Machines. In Mangard, S. (eds.), Smart Card Research and Advanced Applications CARDIS, volume7771 of Lecture Notes in Computer Science. pp. 263-276. Springer Berlin Heidelberg.[15] Heuser, A. & Zohner, M. (2012) Intelligent machine homicide-breaking cryptographic devices using supportvector machines. In Schindler, W. & Huss, S. A. (eds.) Constructive Side-Channel Analysis and Secure Design Third International Workshop, COSADE 2012, Darmstadt, Germany, May 3-4, 2012. Proceedings, volume 7275of Lecture Notes in Computer Science. pp. 249-264. Springer.[16] Hospodar, G., Gierlichs, B., De Mulder, E., Verbauwhede, I. & Vandewalle J. (2011) Machine learning inside-channel analysis: a first study. J. Cryptographic Engineering. 1(4) pp. 293–302.[17] Lerman, L., Bontempi, G. & Markowitch, O. (2014) Power analysis attack: an approach based on machinelearning. International Journal of Advanced Computer Technology. 32 pp. 97-115.[18] Lerman, L., Poussier, R., Bontempi, G., Markowitch, O. & Standaert, F-X. (2015) Template attacks vs.machine learning revisited (and the curse of dimensionality in side-channel analysis). In Mangard, S. (eds.),8

Constructive Side-Channel Analysis and Secure Design - 6th International Workshop, COSADE 2015, Berlin,Germany, April 13-14, 2015. Revised Selected Papers, volume 9064 of Lecture Notes in Computer Science. pp.20-33. Springer.[19] Martinasek, Z., Dzurenda, P. & Malina, L. (2016) Profiling power analysis attack based on MLP in DPAcontest V4.2 In 39th International Conference on Telecommunications and Signal Processing, TSP 2016, Vienna,Austria, June 27-29, 2016. pp. 223–226. IEEE, 2016.[20] Martinasek, Z., Hajny, J., & Malina, L. (2015) Optimization of power analysis using neural network. InFrancillon and Rohatgi, pp. 94–107.[21] Martinasek, Z., Malina, L. & (2015) K. Trasy. Profiling Power Analysis Attack Based on Multi-layerPerceptron Network. Computational Problems in Science and Engineering, 343.[22] Cagli, E., Dumas, C. & Prouff, E. Convolutional neural networks with data augmentation against jitter-basedcountermeasures - profiling attacks without pre-processing. In Fischer, W. & Homma, N (eds.) CryptographicHardware and Embedded Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan, September25-28, 2017, Proceedings, volume 10529 of Lecture Notes in Computer Science. pp. 45–68. Springer.[23] Maghrebi, H., Portigliatti, T., & Prouff, E. Breaking cryptographic implementations using deep learningtechniques. In Carlet, C. M., Hasan, A. & Saraswat, V. (eds.) Security, Privacy, and Applied Cryptography Engineering - 6th International Conference, SPACE 2016, Hyderabad, India, December 14-18, 2016, Proceedings,volume 10076 of Lecture Notes in Computer Science. pp. 3–26. Springer.[24] Standaert, F-X., Koeune, F. & Schindler, W. (2009) How to Compare Profiled Side-Channel Attacks? InAbdalla M., Pointcheval D., Fouque PA. & Vergnaud D. (eds.)Applied Cryptography and Network Security.ACNS 2009. Lecture Notes in Computer Science, vol 5536 pp. 485-498. Springer, Berlin, Heidelberg.[25] Oswald E., Mangard S., Pramstaller N., Rijmen V. (2005) A Side-Channel Analysis Resistant Descriptionof the AES S-Box. In: Gilbert H., Handschuh H. (eds) Fast Software Encryption. FSE 2005. Lecture Notes inComputer Science, vol 3557. Springer, Berlin, Heidelberg[26] Brier E., Clavier C., Olivier F. (2004) Correlation Power Analysis with a Leakage Model. In: Joye M.,Quisquater JJ. (eds) Cryptographic Hardware and Embedded Systems - CHES 2004. CHES 2004. Lecture Notesin Computer Science, vol 3156. Springer, Berlin, Heidelberg[27] Fan G., Zhou Y., Zhang H., Feng D. (2015) How to Choose Interesting Points for Template Attacks MoreEffectively?. In: Yung M., Zhu L., Yang Y. (eds) Trusted Systems. INTRUST 2014. Lecture Notes in ComputerScience, vol 9473. Springer, Cham[28] Schindler W., Lemke K., Paar C. (2005) A Stochastic Model for Differential Side Channel Cryptanalysis. In:Rao J.R., Sunar B. (eds) Cryptographic Hardware and Embedded Systems – CHES 2005. CHES 2005. LectureNotes in Computer Science, vol 3659. Springer, Berlin, Heidelberg[29] Kocher P., Jaffe J., Jun B. (1999) Differential Power Analysis. In: Wiener M. (eds) Advances in Cryptology— CRYPTO’ 99. CRYPTO 1999. Lecture Notes in Computer Science, vol 1666. Springer, Berlin, Heidelberg9

Deep Learning: Project Final Report Casey Long Department of Computer Science Stanford University clong80@stanford.edu Abstract Deep learning neural networks aid in side channel cryptanalysis against AES-128 . Stanford CS230 student. CS230: Deep Learning, Spring 2020, Stanford University, CA. (LateX template borrowed from NIPS 2017.)