Applying NLP Techniques To Malware Detection In A .

Transcription

International Journal of Information REGULAR CONTRIBUTIONApplying NLP techniques to malware detection in a practicalenvironmentMamoru Mimura1· Ryo Ito2 The Author(s) 2021AbstractExecutable files still remain popular to compromise the endpoint computers. These executable files are often obfuscatedto avoid anti-virus programs. To examine all suspicious files from the Internet, dynamic analysis requires too much time.Therefore, a fast filtering method is required. With the recent development of natural language processing (NLP) techniques,printable strings became more effective to detect malware. The combination of the printable strings and NLP techniques canbe used as a filtering method. In this paper, we apply NLP techniques to malware detection. This paper reveals that printablestrings with NLP techniques are effective for detecting malware in a practical environment. Our dataset consists of morethan 500,000 samples obtained from multiple sources. Our experimental results demonstrate that our method is effective tonot only subspecies of the existing malware, but also new malware. Our method is effective against packed malware andanti-debugging techniques.Keywords Malware · Machine learning · Natural language processing1 IntroductionTargeted attacks are one of the serious threats through theInternet. The standard payload such as a traditional executable file has still been remained popular. According toa report, executable file is the second of the top maliciousemail attachment types [49]. Attackers often use obfuscatedmalware to evade anti-virus programs. Pattern matchingbased detection methods such as anti-virus programs barelydetect new malware [28]. To detect new malware, automateddynamic analysis systems and sandboxes are effective. Theidea is to force any suspicious binary to execute in sandboxes,and if their behaviors are malicious, then the file is classifiedas malware. While dynamic analysis is a powerful method,it requires too much time to examine all suspicious filesfrom the Internet. Furthermore, it requires high-performanceservers and their license including commercial OS and applications. Therefore, a fast filtering method is required.BIn order to achieve this, static detection methods withmachine learning techniques can be applicable. These methods extract features from the malware binary and PortableExecutable (PE) header. While the printable strings are oftenanalyzed, they were not a decisive element for detection.With the recent development of natural language processing(NLP) techniques, the printable strings became more effective to detect malware [8]. Therefore, the combination of theprintable strings and NLP techniques can be used as a filtering method.In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLPtechniques are effective for detecting malware in a practical environment. Our time series dataset consists of morethan 500,000 samples obtained from multiple sources. Ourexperimental result demonstrates that our method can detectnew malware. Furthermore, our method is effective againstpacked malware and anti-debugging techniques. This paperproduces the following contributions.Mamoru Mimuramim@nda.ac.jp1National Defense Academy 1-10-20 Hashirimizu, Yokosuka,Kanagawa, Japan2Japan Ground Self-Defense Force 5-1 Honmura-cho,Ichigaya, Shinjuku-ku, Tokyo, Japan– Printable strings with NLP techniques are effective fordetecting malware in a practical environment.– Our method is effective to not only subspecies of theexisting malware, but also new malware.123

M. Mimura, R. Ito– Our method is effective against packed malware and antidebugging techniques.The structure of this paper is as follows. Section 2describes related studies. Section 3 provides the natural language processing techniques related to this study. Section 4describes our NLP-based detection model. Section 5 evaluates our model with the dataset. Section 6 discusses theperformance and research ethics. Finally, Section 7 concludes this study.2 Related work2.1 Static analysisMalware detection methods are categorized into dynamicanalysis and static analysis. Our detection model is categorized into static analysis. Hence, this section focuses on thefeatures used in static analysis.One of the most common features is byte n-gram featuresextracted from malware binary [47]. Abou-Assaleh et al. usedthe frequent n-grams to generate signatures from maliciousand benign samples [1]. Kolter et al. used information gainto extract 500 n-grams features [12,13]. Zhang et al. alsoused information gain to select important n-gram features[55]. Henchiri et al. conducted an exhaustive feature searchon a set of malware samples and strived to obviate overfitting [6]. Jacob et al. used bigram distributions to detectsimilar malware [9]. Raff et al. applied neural networks toraw bytes without explicit feature construction [41]. Similarapproaches are extracting features from the opcode n-grams[3,10,35,57] or program disassembly [7,14,18,44,50].While many studies focus on the body of malware, severalstudies focus on the PE headers. Shafiq et al. proposed aframework that automatically extracts 189 features from PEheaders [48]. Perdisci et al. used PE headers to distinguishbetween packed and non-packed executables [38]. Eloviciet al. used byte n-grams and PE headers to detect maliciouscode [4]. Webster et al. demonstrated how the contents ofthe PE files can help to detect different versions of malware[53]. Saxe et al. used a histogram of byte entropy values,DLL imports, and numerical PE fields with neural networks[45]. Li et al. extracted top of features from PE headers andsections with a recurrent neural network (RNN) model [17].Raff et al. used raw byte sequences obtained from PE headerswith a Long Short-Term Memory (LSTM) network [40].Other features are also used to detect malware. Schultz etal. used n-grams, printable strings, and DLL imports withmachine learning techniques for malware detection [46].Masud et al. used byte n-grams, assembly instructions, andDLL function calls [20]. Ye et al. used interpretable stringssuch as API execution calls and important semantic strings123[54]. Lee et al. focused on the similarity between two filesto identify and classify malware [16]. The similarity is calculated from the extracted printable strings. Mastjik et al.analyzed string matching methods to identify the same malware family [19]. Their method used 3 pattern matchingalgorithms, Jaro, Lowest Common Subsequence (LCS), andn-grams. Kolosnjaji et al. proposed a method to classifymalware with neural network that consists of convolutionaland feedforward neural constructs [11]. Their model extractsfeature from the n-grams of instructions, and the headersof executable files. Aghakhani et al. studied how machinelearning-based on static analysis features operates on packedsamples [2]. They used a balanced dataset with 37,269 benignsamples and 44,602 malicious samples.Thus, several studies used the printable strings as features. However, the printable strings are not used as themain method for detection. The combination of the printablestrings and NLP techniques are not evaluated in a practicalenvironment. This paper pursues the possibility of the printable strings as a filtering method.2.2 NLP-based detectionOur detection model uses some NLP techniques. This sectionfocuses on the NLP-based detection methods.Moskovitch et al. used some NLP techniques such as TermFrequency-Inverse Document Frequency (TF-IDF) to represent byte n-gram features [35]. Nagano et al. proposed amethod to detect malware with Paragraph Vector [36]. Theirmethod extracts the features from the DLL Import name,assembly code, and hex dump. A similar approach is classifying malware from API sequences with TF-IDF and ParagraphVector [51]. This method requires dynamic analysis to extractAPI sequences. Thus, the printable strings are not used as themain method for detection.NLP-based detection was applied to detect malicious traffic and other contents. Paragraph Vector was applied toextract the features of proxy logs [30,32]. This method wasextended to analyze network packets [22,31]. To mitigateclass imbalance problems, the lexical features are adjustedby extracting frequent words [23,33]. Our method uses thistechnique mitigate class imbalance problems. Some methodsuse a Doc2vec model to detect malicious JavaScript code[29,37,39]. Other methods use NLP techniques to detectmemory corruptions [52] or malicious VBA macros [24–27,34].3 NLP techniquesThis section describes some NLP techniques related to thisstudy. The main topic of NLP techniques is to enable computers to process and analyze large amounts of natural

Applying NLP techniques to malware.language data. The documents written in natural languageare separated into words to apply NLP techniques such asBag-of-Words. The corpus of words is converted into vectors which can be processed by computers.3.1 Bag-of-wordsBag-of-Words (BoW) [43] is a fundamental method of document classification where the frequency of each word is usedas a feature for training a classifier. This model converts eachword in a document into vectors based on the frequency ofeach word. Let d, w, and n be expressed as a document, word(wi 1,2,3,. ), and a frequency of w, respectively. The document d can be defined by Eq. (1). For this Eq. (1), next (2)locks the position of n on d, and omitted the word w. This dˆj(d̂ j 1,2,3,. ) is shown as a vector (document-word matrix).In Eq. (3), let construct the other documents to record theterm frequencies of all the distinct words (other documentsordered as in Eq. (2)).d [(w1 , n w1 ), (w2 , n w2 ), (w3 , n w3 ), ., (wi , n wi )]d̂j (n w1 , n w2 , n w3 , ., n wi ) n w1,1 . . .n w1,i . D .u i,1 . . .u i,r v1,1 . . .v1, j0. . . . . . . . . 0 . . .σr ,rvr ,1 . . .vr , j(3)3.3 Paragraph vector3.2 Latent semantic indexingLatent Semantic Indexing (LSI) analyzes the relevancebetween a document group and words included in a document. In the LSI model, the vectors with BoW are reducedby singular value decomposition. Each component of the vectors is weighted. The decomposed matrix shows the relevancebetween the document group and words included in the document. In weighting each component of the vector, TermFrequency-Inverse Document Frequency (TF-IDF) is usually used. D is the total number of documents, {d : d ti }is the total document including word i, f r equencyi, j is theappearance frequency of word i in document j. TF-IDF isdefined by Eq. (4).xi,1 . . .xi, j σ1,1 . . .u 1,1 . . .u 1,r . . . . . . . . . . . . (2)Thus, this model enables to convert documents into vectors. Apparently, this model does not preserve the order ofthe words in the original documents. The dimension of converted vectors attains the number of distinct words in theoriginal documents. To analyze large-scale data, the dimension should be reduced so that can be analyzed in a practicaltime.t f i, j id f i f r equencyi, j x1,1 . . .x1, j X . . . . . U V T In this model, the number of dimension can be determinedarbitrarily. Thus, this model enables to reduce the dimensionso that can be analyzed in a practical time.(1)n w j,1 . . .n w j,i D log{d : d ti }TF-IDF weights the vector to perform singular valuedecomposition. The components x( i, j) of the matrix X showthe TF-IDF value in the document j of the word i. Let X bedecomposed into orthogonal matrices U and V and diagonalmatrix , from the theory of linear algebra. In this singular value decomposition, U is a column orthogonal matrixand linearly independent with respect to the column vector.Therefore, U is the basis of the document vector space. Thematrix X product giving the correlation between words anddocuments is expressed by the following determinant. Generally, this matrix U represents a latent meaning.(4)To represent word meaning or context, Word2vec was created[21]. Word2vec is shallow neural networks which are trainedto reconstruct linguistic contexts of words. Word2vec takesas its input a large corpus of documents and produces a vector space of several hundred dimensions, with each uniqueword in the corpus being assigned a corresponding vector inthe space. Word vectors are positioned in the vector spacesuch that words which share common contexts in the corpus are located in close proximity to one another in thespace. queen king man woman is an example ofoperation using each word vector generated by Word2vec.Word2vec is a method to represent a word with meaning orcontext. Paragraph Vector is the extension of Word2vec torepresent a document [15]. Doc2vec is an implementation ofthe Paragraph Vector. The only change is replacing a wordinto a document ID. Words could have different meanings indifferent contexts. Hence, vectors of two documents whichcontain the same word in two distinct senses need to accountfor this distinction. Thus, this model represents a documentwith word meaning or context.4 NLP-based detection modelThis section produces our detection model based on the previous study [8]. The previous study used an SVM model toclassify samples. The main revised point is adding several123

M. Mimura, R. ItoAlgorithm 1 trainingFig. 1 Structure of the NLP-based detection model 4classifiers. The structure of our detection model is shown inFig. 1.Our detection model consists of language models andclassifiers. One model is selected from these models, respectively. In training phase, the selected language model isconstructed with extracted words from malicious and benignsamples. The constructed language model extracts the lexicalfeatures. The selected classifier is trained with the lexical features and labels. In testing phase, the constructed languagemodel and trained classifier classify unknown executablefiles into malicious or benign ones.4.1 TrainingThe training procedure is shown in Algorithm 1. Our methodextracts all printable (ASCII) strings from malicious andbenign samples and splits the strings into words, respectively. The frequent words are selected from the words,respectively. Thenceforth, the selected language model isconstructed from the selected words. Our method uses aDoc2vec or LSI model to represent the lexical features. TheDoc2vec model is constructed by the corpus of the words.The LSI model is constructed from the TF-IDF scores ofthe words. These words are converted into lexical featureswith the selected language model. Thus, the labeled feature vectors are derived. Thereafter, the selected classifieris trained with the both labeled feature vectors. The classifiers are Support Vector Machine (SVM), Random Forests(RF), XGBoost (XGB), Multi-Layer Perceptron (MLP), andConvolutional Neural Networks (CNN). These classifiers arepopular in the various fields, and have each characteristic.4.2 TestThe test procedure is shown in Algorithm 2. Our methodextracts printable strings from unknown samples, and splitsthe strings into words. The extracted words are converted into1231: /* Extract printable strings */2: for all malware samples m do3:mw extract printable strings from m4: end for5: for all benign samples b do6:bw extract printable strings from b7: end for8: imw select frequent words from m9: ibw select frequent words from b10: if Doc2vec then11:/* Construct a Doc2vec model */12:construct a Doc2vec model from imw, ibw13: else14:/* Construct a LSI model */15:construct a TF-IDF model from imw, ibw16:construct a LSI model from the TF-IDF17: end if18: /* Convert printable strings into vectors */19: for all malware samples mw do20:if Doc2vec then21:mv Doc2vec(mw)22:else23:mv LSI(mw)24:end if25: end for26: for all benign samples bw do27:if Doc2vec then28:bv Doc2vec(bw)29:else30:bv LSI(bw)31:end if32: end for33: /* Train classifiers with the labeled vectors */34: if SVM then35:Train SVM(mv, bv)36: else if RF then37:Train RF(mv, bv)38: else if XGB then39:Train XGB(mv, bv)40: else if MLP then41:Train MLP(mv, bv)42: else43:Train CNN(mv, bv)44: end if45: returnlexical features with the selected language model, which wasconstructed in training phase. The trained classifier examinesthe feature vectors and predicts the labels.4.3 ImplementationOur detection model is implemented in Python 2.7. Gensim[42] provides the LSI and Doc2vec models. Scikit-learn 1provides the SVM and RF classifiers. The XGB is providedas a Python package 2 . The MLP and CNN are oost.readthedocs.io/.

Applying NLP techniques to malware.Algorithm 2 test1: /* Extract printable strings */2: for all unknown samples u do3:uw extract printable strings from u4: end for5: /* Convert printable strings into vectors */6: for all unknown samples uw do7:if Doc2vec then8:uv Doc2vec(uw)9:else10:uv LSI(uw)11:end if12: end for13: /* Predict labels with the trained classifiers */14: for all unknown vectors uv1 do15:if SVM then16:label SVM(uv)17:else if RF then18:label RF(uv)19:else if XGB then20:label XGB(uv)21:else if MLP then22:label MLP(uv)23:else24:label CNN(uv)25:end if26: end for27: returnwith chainer 3 . The parameters will be optimized in the nextsection.5 Evaluation5.1 DatasetTo evaluate our detection model, hundred thousands of PEsamples were obtained from multiple sources. One is theFFRI dataset, which is a part of MWS datasets [5]. Thisdataset contains logs collected from the dynamic malwareanalysis system Cuckoo sandbox 4 and a static analyzer. Thisdataset is written in JSON format, and categorized into 2013to 2019 based on the obtained year. Each data except 2018contains printable strings extracted from malware samples.These data can be used as malicious samples (FFRI 2013–2017, 2019). Note that this dataset does not contain malwaresamples themselves. Printable strings extracted from benignsamples are contained in 2019’s as Cleanware. These benigndata do not contain the time stamps. Hence, we randomly categorized these data into 3 groups (Clean A, B, and C). Othersamples are obtained from Hybrid Analysis (HA) 5 , which isa popular malware distribution site. Almost ten thousand ofTable 1 The number of each data, unique words, and family nameClassDatasetClean A10,00039,390,887–benignClean B40,000122,217,004–(Cleanware)Clean C200,000269,112,345–FFRI 20132,6371,317,912621FFRI 20143,0005,193,423671FFRI 20153,0003,224,583474FFRI 20168,24311,443,274629FFRI 20176,2511,534,580394FFRI 2019250,000320,177,7233237HA 20165,78713,352,4231,063HA 20172,8348,944,685181HA 20185,62315,584,544243MaliciousFileUnique wordsFamilysamples are obtained with our web crawler. These samplesare posted into the VirusTotal 6 , and are identified by multiple anti-virus programs. Thereafter, the identified samples arecategorized into 2016 to 2018 based on the first year definedby Microsoft defender. The printable strings are extractedfrom these malware samples (HA 2016–2018). To extractprintable strings from these samples, we use the strings command on Linux. This command provides each string on oneline. Our method uses these strings as word. Our extractionmethod is identical to the FFRI dataset 7 . Thus, our datasetis constructed with the multiple sources.Table 1 shows the number of each data, unique words, andfamily name. Tables 2 and 3 show the top family names.The unique words indicate the number of distinct wordsextracted from the whole dataset. In the NLP-based detectionmethod, the number of unique words is important. Becausethe classification difficulty and complexity mainly dependon the number. The family indicates the number of distinctmalware family defined by Microsoft defender. In benignsamples, each dataset contains huge number of unique words.This means these benign samples are well distributed and notbiased. In malicious samples, each dataset contains enoughnumber of unique words and malware families. This suggeststhey contain not only subspecies of the existing malware, butalso new malware. Hence, these samples are well distributedand not biased.5.2 MetricsSeveral metrics are used to evaluate our detection model.These metrics are based on the confusion matrix shown inTable t-scripts.123

M. Mimura, R. ItoTable 2 Top 5 family names in the FFRI datasetTable 3 Top 10 family names in the HA datasetDatasetDatasetFFRI 2013FFRI 2014FFRI 2015FFRI 2016FFRI 2017FFRI 2019No.Family nameNo.Family Ogimant2HA tHA 2017HA bitIn this experiment, Accuracy (A), Precision (P), Recall(R), and F1 score (F) are mainly used. These metrics aredefined as Eqs. (5)–(8).Accuracy Pr ecision Recall F1 TP TNT P FP FN T NTPT P FPTPT P FN2Recall Pr ecision.Recall Pr rue Positive (TP)False Positive (FP)NegativeFalse Negative (FN)True Negative (TN)(6)(7)(8)In this experiment, TP indicates predicting malicious samples correctly. Since our detection model performs binaryclassification, Receiver Operating Characteristics (ROC)curve and Area under the ROC Curve (AUC) are used. An123Table 4 Confusion matrixROC curve is a graph showing the performance of a classification model at all classification thresholds. AUC measuresthe entire two-dimensional area underneath the entire ROCcurve.5.3 Parameter optimizationTo optimize the parameters of our detection model, the CleanA–B and FFRI 2013–2016 are used. The Clean A and FFRI

Applying NLP techniques to malware.Table 6 The optimized parameters in each classifierClassifierParameterOptimum valuekernelrbfrbfC100100gamma0.010.1n estimators393442n jobs1432n estimators998911max depth1135Doc2vecSVMRFXGBLSImin child weight11subsample0.50.8colsample zerAdamAdamepoch4040Fig. 2 The F1 score for each modelTable 5 The optimized parameters in each language modelLanguage modelDoc2vecLSIParameterOptimum valueDimension400alpha0.075min count2window1epoch20Dimension800Fig. 3 The result of tenfold cross-validation2013–2015 are used as the training samples. The remainderClean B and FFRI 2016 are used as the test samples.First, the number of unique words is optimized. To construct a language model, our detection model selects frequentwords from each class. The same number of frequent wordsfrom each class are selected. This process adjusts the lexicalfeatures and enables to mitigate class imbalance problems[23]. The F1 score for each model is shown in Fig. 2.In this figure, the vertical axis represents the F1 score,and the horizontal axis represents the total number of uniquewords. In the Doc2vec model, the most optimum numberof the unique words is 500. In the LSI model, the F1 scoregradually rises and achieves the maximum value at 9000.Thereafter, the other parameters are optimized by gridsearch. Grid search is a search technique that has been widelyused in many machine learning researches. The optimizedparameters are shown in Tables 5 and 6.Thus, our detection model uses these parameters in theremaining experiments.5.4 Cross-validationTo evaluate the generalization performance, tenfold crossvalidation is performed on the Clean A and FFRI 2013–2015.Figure 3 shows the result.The vertical axis represents the Accuracy (A), Precision(P), Recall (R), or F1 score (F). Overall, each metric performed good accuracy. The LSI model was more effectivethan the Doc2vec model. Thus, the generalization performance of our detection model was almost perfect.5.5 Time series analysisTo evaluate the detection rate (recall) for new malware,the time series is important. The purpose of our method isdetecting unknown malware. In practical use, the test samples should not contain the earlier samples. To address thisproblem, we consider the time series of samples. In this123

M. Mimura, R. ItoFig. 4 The result of Doc2vec in the time series analysisFig. 6 The ROC curve of Doc2vec in the time series analysisFig. 5 The result of LSI in the time series analysisexperiment, the Clean A and FFRI 2013–2015 are used asthe training samples. The Clean B, FFRI 2016–2017, andHA 2016–2018 are used as the test samples. As describedin Table 1, the training samples are selected from the earlierones. Moreover, the benign samples account for the majorityof the test samples. This means the test samples are imbalanced, which represent more practical environment. Thus,the combination of each dataset is more challenging thancross-validation. The results of the time series analysis areshown in Figs. 4 and 5.The vertical axis represents the recall. Overall, the recallgradually decreases as time proceeds. The detection rates inthe FFRI dataset are better than the ones in the HA dataset.This seems to be because the training samples were obtainedfrom the same source. Nonetheless, the recall in HA maintains almost 0.9. Note that these samples were identified byVirusTotal and categorized based on the first defined year.The LSI model was more effective than the Doc2vec model.In regard to classifiers, the SVM and MLP performed goodaccuracy. Thus, our detection model is effective against newmalware. Furthermore, the same technique [23] mitigatesclass imbalance problems in executable files.To visualize the relationship between sensitivity and specificity, the ROC curves of each model with FFRI 2016 aredepicted in Figs. 6 and 7.123Fig. 7 The ROC curve of LSI in the time series analysisThe vertical axis represents the true positive rate (recall),and the horizontal axis represents the false positive rate. Ourdetection model maintains the practical performances with alow false positive rate. As we expected, the LSI model wasmore effective than the Doc2vec model. The best AUC scoreachieves 0.992 with the LSI and SVM model.The required time for training and test of FFRI 2016 isshown in Table 7.This experiment was conducted on the computer withWindows 10, Core i7-5820K 3.3GHz CPU, 32GB DDR4memory, and Serial ATA 3 HDD. In regard to training time,it seems to depend on the classifier. Complicated classifierssuch as CNN require more time for training. The test timemaintains flat regardless of the classifier. The time to classifya single file is almost 0.1s. This speed seems to be enough toexamine all suspicious files from the Internet.

Applying NLP techniques to malware.Table 7 The required time for training and test of FFRI NN27:59RF7:02XGBTable 8 Detection rate (recall) of the known and unknown malware(LSI and SVM)23:40test15:1815:1115:1915:1115:11Knowntest 15:2015:20test (s/file)0.1120.1120.1140.1120.112FFRI 2016FFRI 2017HA 921,9500.9893,3360.9642,9440.9676 Discussion6.1 Detecting new malwareFig. 8 The average result of the practical experimentIn the time series analysis, our detection model is effective to new malware on the imbalanced dataset. The newmalware samples are categorized into known malware andunknown malware. In this study, we assume that knownmalware samples are ones previously defined by Microsoftdefender. These samples are new but just subspecies of theexisting malware. We also assume that unknown malwaresamples are ones not defined by Microsoft defender at thattime. In this experiment, our detection model was trained bythe samples before 2015. Hence, the newly defined samplesafter 2016 are assumed as new malware. The detection rateof the known and unknown malware is shown in Table 8.The detection rate of unknown malware is on the samelevel with known malware. Thus, our detection model iseffective to not only subspecies of the existing malware, butalso new malware.6.2 Defeating packed malware and anti-debuggingtechniques5.6 Practical performanceIn practical environment, actual samples are more distributed. Hence, the experimental samples might not represent the population appropriately. To mitigate this problem,a more large-scale dataset has to be used. Moreover, thetraining samples should be smaller. To represent actual sample distribution, the FFRI 2019 and Clean A–C are used.They contain 500,000 samples. These samples are randomlydivided into 10 groups. One of the groups is used as the training samples. The rest 9 groups are used as the test samples.The training

printable strings became more effective to detect malware. The combination of the printable strings and NLP techniques can be used as a filtering method. In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLP techniques are effective for detecting malw