A Similarity Based Technique For Detecting Malicious Executable Files .

Transcription

A Similarity based Technique for Detecting Malicious Executable filesfor Computer ForensicsJun-Hyung Park1, Minsoo Kim2, Bong-Nam Noh3, James B D Joshi11School of Information Science, University of Pittsburgh, USA{jpark, jjoshi}@mail.sis.pitt.edu2Division of Information Engineering, Mokpo National University, Koreaphoenix@mokpo.ac.kr3Division of Electronics Computer & Information EngineeringChonnam National University, Koreabbong@jnu.ac.krmalefactors. To facilitate such evidence-based actions,computer forensics is emerging as a significant tool.While computer forensics emphasizes techniques toidentify and trace malicious activities in a system, theknowledge of the existence of such tools itself can act asa deterrent to potential hackers.In general, forensic investigators use logs andmetadata for reconstruction of cyber crimes [2], forinstance, by attempting to recreate the malicious activitiesby analyzing and constructing a time line of activities.However, as it is still in its infancy, the computerforensics area lacks the adequate tools and techniques tosupport sophisticated investigations. In general, we areonly able to detect the use of well-known malicioushacking tools through a signature-based technique, whichalso allow us to check whether normal system programshave been modified [3]. In particular, using existingforensic techniques, we cannot determine which tools themalicious users have used without analyzing the filesystem metadata. Typically, we can not identify amalicious program without executing it even though wemay find that suspicious executable files have beeninstalled, particularly, if it is not a known tool with arecognizable signature. Skilled hackers typically takeextra measures to hide the evidence of their crimes so thatforensics analysts cannot detect them. For instance, thereare tools which can read and write parts of maliciousexecutable file to the slack space of a ation primarily focus on using metadata and textinformation [4, 5, 6, 7]. For example, forensicinvestigators would draw a time line showing file usagewith an i-node table after making a back-up image of thehard drive. The investigator would then typically searchfor particular words, hidden files, and suspicious filenames throughout the entire file system. In general, theinvestigators also attempt to verify the checksums of asystem’s instruction files with a CRC algorithm to see ifthey are exactly same as the original to ensure that apotential hacker has not changed the executable files.AbstractWith the rapidly increasing complexity of computersystems and the sophistication of hacking tools andtechniques, there is a crucial need for computer forensicanalysis techniques. Very few techniques exist to supportforensic analysis of unknown executable files. Theexisting techniques primarily inspect executable files todetect known signatures or are based on metadatainformation. A key goal of such forensic investigation isto identify malicious executable files that hackers mighthave installed in a targeted system. Finding suchmalware in a compromised system is difficult because it ishard to identify the purpose of the fragments ofexecutable files. In this paper, we present a similaritybased technique that analyzes targeted executable files toidentify a malware present in a compromised system. Thetechnique involves assigning a similarity value to thefragments of executable files present in a compromisedhard disk against a set of source files. We present someresults based on the comparison of assembly instructionsequences of well-known hacking tools with those ofvarious executable files, and suggest various ways toreduce the false positives.Keywords – assembly instruction code, maliciousprogram, computer forensics, similarity1. IntroductionWith the rapidly increasing complexity andinterconnectedness of emerging information systems, thenumber of cyber crimes is increasing sharply. While thereare significant advancements in security technologies,there is also a similar proliferation of sophisticatedhacking tools and techniques. Therefore, the protection ofsystems as well as the establishment of evidence-basedaccountability for malicious actions poses significantchallenges [1]. In addition to protective mechanisms, wealso need tools and techniques to analyze and identify thecyber crimes and the culprits committing them so that theproper evidence-based actions can be taken against the0-7803-9788-6/06/ 20.00 2006 IEEE.188

In this paper, we present a technique for forensicanalysis of compromised systems by analyzing the harddrives. Our technique extracts executable content fromthe hard drive to identify whether or not it is part of amalicious program. We compare assembly instructionsequences of well-known hacking programs with those offragments of executable files scattered in the disk, andcalculate their similarity values. Using executable contentto identify malware, to the best of our knowledge, has notyet been attempted. In particular, the contributions of thispaper are as follows:x We suggest a profiling technique using the assemblyinstruction sequence in the executable filesx We introduce the notion of similarity between twoexecutable files by aggregating the similaritiesbetween sequential blocks of instructions separated byinstructions for conditional transfer of execution flow.x We propose various techniques to compute similaritymeasures to reduce false positives in the malwareidentification process.searching for pattern strings and file fragments), it is clearthat investigators will use common techniques to searchfor some signatures of known malicious programs orstrings of suspicious fragments in the hard drives [9].Such techniques, however, can only be used when weknow what we want to find. In general, we cannot findmalicious programs when they are new with no knownsignatures and only some fragments exist [5].If forensics investigators can not trace the activities ofmalicious users with common skills, they will have toanalyze data more closely and precisely [10]. There arestatic and dynamic approaches to detect maliciousprograms. Static analysis involves various forms ofexamination without executing or running the executablefiles. By executing an executable file during dynamicanalysis, using specialized monitoring utilities such asdebuggers, we can trace or alter the program. Staticanalysis can eventually allow us to “know all” about thetool, whereas dynamic analysis may be limited simply bythe virtue of how the programmer allows the user tointeract with the application. However, in some cases, afull static analysis can’t be accomplished withoutperforming dynamic analysis also.If the forensic investigators are sophisticated enoughand there is not an overwhelming amount of data on thehard drives to investigate using reverse engineeringtechniques, they might be able to find malicious programsas evidence. However, there is a limit on the human andtime resources for forensics investigations.The paper is organized as follows. In Section 2, wepresent background on computer forensics. In Section 3,we propose a technique for profiling executable files andcalculating the similarity between two files. In Section 4,we present techniques to reduce false positives andpresent experimental results. Section 5 concludes anddiscusses future work.2. Background3. Similarity Based DetectionThe techniques for hacking have been changing inresponse to development of security systems. The CERTgives an overview of recent trends in attack tools afterobserving intrusion activities since 1988 [2]. We believethat the most dangerous aspect of the field is the ease ofpropagation of malicious programs. It is critical toaddress this, as systems are continually facing types ofattacks which have not been seen before, especially withthe increasing level of automation of attack tools. Further,with the increasing modularization of attack tools, newer,more powerful attack tools can be easily created. Unlikeearly attack tools that implement one type of attack, suchtools now can be changed quickly by upgrading orreplacing their components. This causes rapidly evolvingattacks and, at the extreme, results in polymorphic toolsthat self-evolve, changing with each active instance of theattack. There exists another serious problem of antiforensic techniques. Anti-forensic techniques can hide thenature of the attack tools, making it more difficult andtime consuming for security experts to analyze the toolsand to understand responses to rapidly developing threats.When we conduct forensic analysis, we follow generalprocess of computer forensics investigation [8]. Eventhough the process for forensic analysis consists of manysteps and techniques (from recovering deleted files toIn this section, we present the propose similarity basedtechnique for comparing executable files to detect amalware present in a compromised hard disk. Inparticular, the technique focuses on analyzing theinstruction sequences that can be found in the hard disk toidentify the malware. We compare assembly instructionsequences of well-known hacking programs with those ofthe targeted executable files, and calculate the similaritybetween them. We note that the experiments presented inthis paper have been carried out for the MIPS assemblylanguage for Intel-based system in a Linux platform.3. 1 Executable File ProfileA key aspect of a program execution is the sequentialexecution of assembly instructions and the flow control.The machine executes one assembly instruction at a timesequentially until a transfer of execution flow to anotherportion of the executable file occurs. The assemblyinstruction cmp is one comparison condition that isassociated with such a transfer of execution points. Thecmp instruction involves checking for jump conditions.Furthermore, a conditional statement also has theimportant characteristic similar to the block rule of aprogramming language, i.e., a block has a start and the189

end as indicated by the opening and the closing curlybrackets in a C program block. Thus, in general, the set ofassembly instructions between two cmp statements aresequentially executed. Table 1 shows an example of acmp block associated with some high level programstatements.Table 1. An example of cmp instructionsHigh-level languageTHEN CX: 10CMPJNZCMPJNZMOVJMPAX, 0L1BX, 0L1CX, 10L2MOVCX, 20We use this feature associated with the cmp commandsto identify good instruction patterns. In particular, we usethe cmp assembly instruction as a basis for dividing theinstruction sequence into cmp blocks. Each cmp block(CB) is a sequence of assembly instructions between apair of consecutive cmp commands. Within each CB wecompute the frequencies of each instruction. Each CB canthus be expressed as a frequency vector (fv) that containsfrequencies of each instruction. We represent by fvi thefrequency vector of cmpi. Figure 1 shows n CBs and theirassociated fvs. For instance, the frequency of the secondinstruction in block cmp0 is 3 and that of cmp2 is 1. Thetotal number of bytes for the instructions in each CB isshown in the last column. A frequency matrix (fm) canthus be constructed to capture the instruction frequencyfor a given set of assembly instructions.cmpn-2cmpn-1cmpnFrequencies of each assembly instructions etSETFOR150130 013405015026002714066600Length(bytes)94790 1922798Similarity(s,t).n is the number of CBs of source programm is the number of CBs of target programa n u m matrixrow 1 to n-2FOR column 1 to m-2IF2 SM [row i, column i] Critical valuei 0ThenSTORE SM[row, column], SM[row 1,column 1],SM[row 2,column 2]ENDIFENDFORENDFORAfter profile a sequence of assembly instructions of anexecutable file, we have to find out whether there existCBs in the target program which is similar to any CBs ofthe source program (Figure 1). We use the followingcosine distance similarity measure to calculate thesimilarity of two CBs as follows:Figure 2. Algorithm for detecting similar blocksNote that we detect similar blocks by calculatingfrequencies of assembly instructions. In addition, moresimilar blocks should have similar instruction sequencesas well. To address this issue, we later investigatecombining a number of contiguous blocks duringsimilarity computation to reduce false positives. Inparticular, we combine three similarity values associatedwith the continuous CBs in the matrix to check against acritical value to compute the similarity measure, in thes x ts tt After we have found all similar blocks we create asimilarity matrix indicating the similarity values for theCBs in the source and the target programs. Next, we haveto judge what parts of the target file are similar with thatof the source program. Typically, just some blocks can beexpected to appear similar. If blocks from the source aredetected to be similar to those in the target, but are notactually part of the target program, they should be flaggedas false positives and removed.3. 2 Calculating Similarity Between cmp Blockswhere s andprogram. 3. 3 Detection of Similar SectionsTable 2. Sample cmp blocks and frequency vectors010cmp0cmp1cmp2It is important to note, however, that there exist shortCBs which consists very few instructions – this may bebecause the conditional statements are located in theprogram continually. In such a case, the length of the CBis very short and the frequency of assembly instructions isvery low.L2 :321cmp0cmp1cmp2Figure 1. Similarity between profiles of CBs of sourceand the target executable filesL1 :ELSE CX : 20031TargetprogramAssembly languageIF(AX 0 and BX 0)cmpblockscmp0cmp1cmp2Sourceprogramare vectors of CBs of source and target190

experiments. Figure 2 shows the algorithm for detectingseries of similar CBs from the similarity matrix.sections in the malicious program that are similar tothem. In this case, we cannot be sure if the targetprogram contains some parts of the source.3. 5 Similarity Algorithm PWe make unique profile of each assembly instructionsequence so that they are distinguishable from each other.For this work, the key issue is the exact description ofthese sequences into the frequency vector. Beforecalculating the similarity with detected fragments, weassume two conditions for two sequences that are exactlymatching: (i) frequencies of elements in two sequencesmust be same, and (ii) all the elements consistingsequences must be allocated sequentially. Note that theseconditions are independent to each other. We cancalculate the similarity by considering both these factorsto make the similarity measure more effective. P Pj P Pk G Pk G Figure 3. A sample of contraction for detectedsections from target programFigure 3 shows possible distribution of series ofsimilar CBs from the target file - here X-axis represents atarget program and Y-axis represents a source program.We want to detect some sections in the target programthat are similar to some parts of source program. Hence,we assume one CB on the X-axis could take just one CBon the Y-axis. For example, the fourth block of series a)and the first block of series b) are overlapped on the Xaxis in Figure 3. This means one block of target programis similar with two different blocks of the sourceprogram. For this case we used real length of CBs in theprofile without considering the number of blocks. If seriesb) is detected, then the length of series b) is longer thanthat of series a). By iteration of this kind of contractions,we can detect several series of similar blocks that cannow be regarded as similar fragments.3. 5.1 Similarity from Detected FragmentsIn this section, we will explain how to calculate thesimilarity with detected similar fragments between twoexecutable files. If we detect n similar fragments fromtarget program then there exist n fragments in sourceprogram even though they can be duplicated. We candescribe this as below.S {c1 , c2 , c3 , , cn}, T {c1’, c2’, c3’, , cn’}Also, ci S there must be a ci’ T which is similarto ci, and each pair of ci and ci’ has its own similarity. It isto be noted that there is another problem we have toconsider for calculating similarity of detected similarfragments. All similarity values for every pair of ci and ci’generally do not have same weights because the lengthsof detected fragments are totally different. Sometimes thelength of one fragment is shorter than 10 bytes but in along section it could be longer than 100 bytes. These twocases of similarities should not have the same weight forcalculating similarity for the total detected fragments.Hence, it is natural to calculate weighs for each fragmentdepending on its length. Accordingly, we use thefollowing formula for calculating the similarity values:3. 4 Distribution of Similar FragmentsThere are four types of possible distributions of similarfragments we can inspect in the target executable file withrespect to the source program, which are as follows.a) Fully Contained: Source program is contained in thetarget file. Here, the similar blocks are distributedcontiguously diagonally in the similarity matrix.b) Partial Existence: Some parts of the source programexist in the target file. Here, the line of distributioncovers X-axis fully but not the Y-axis.c) Disconnected, But Contained – The source program iscontained in the target file but it is modified. If the lineof similar fragments in the similarity matrix is notdistributed contiguously, we can expect that the sourceprogram has been modified for some purpose. Eventhough similar fragments are not distributedcontiguously, the target may contain source programwith high possibility.d) Irregularly Ordered, but Contained: The fragments inthe target program that are similar to those in themalicious program but the sections in the targetprogram are not ordered as are the correspondingnsimilarity of CBs ( s , t ) w u sim c , cii'ii 1wimin(size(ci ), size(ci' )) u sim ci , ci'n min(size(c ), size(c )) u sim c , cj'jj'jj 1Where ci'arg max sim ci , ck' , for ( k 1,., n)3. 5.2 Order of Similar SectionsOnce similarity has been detected by taking the lengthof the blocks into consideration, we have to calculatesimilarity based on the instruction order of the detectedsimilar fragments. We estimate this value based on howsimilarly the detected sections are ordered. Fortunately, it191

higher than 90%, we obtained fewer false positives than0.5% with the 100% of accuracy.is not a new problem. We can use the same mechanismthat has been used for comparing two strings [15]. Forthis we use the Damerau-Levenshtein edit distance [15].Damerau-Levenshtein edit distance counts a transpositionas a single edit operation. We calculate similarity of theirorder using the edit distance as follows.sequence matching ( s, t ) 1 4.2 Number of Blocks and False PositivesDespite of 0.5% of false positives of the proposedprofiling technique, it is still not enough to decidewhether two CBs are similar because the length of theexecutable file could be really long (more than 1 or 2Giga bytes) and it might consist of millions of CBs. Notethat if detected blocks are contiguous, then there is ahigher probability that the series of detected blocks arereally similar than shorter ones. We conducted anexperiment to measure the relation between the reductionin false positives and the number of CBs.dist s, tmax Length s , Length tWe next calculate two probabilities for the similaritiesbetween frequencies of two sets of assembly instructionsand their order in the sequence. If there are some sectionsthat satisfy the above two conditions, we can nowconsider them similar with more certainty. As the twoconditions are independent, we can get the final similarityvalue by multiplying the two results. Thus, the formulafor calculating the similarity of between two sequences isas follows.Similarity s, tXYSWWWSWWWSimilarity of CBs s, t u sequence matching(s, t )]SWWW4. Experimental Results[SWWWIn this section, we present some experimental resultsbased on the proposed similarity based detection ofmalicious code. In particular, we experimented to seewhether any CB can be distinguished from the others byusing the proposed profiling technique. We tested about100 executable files to observe the result of applying thistechnique for distinguishing the CBs.YSWWWWXj Gz Gk G{ Lj Gk Table 3. Test results for detecting a malware GX GY GZ G[ G\ G] G G G GXW \L[LZLYLXLMaliciousprogramThe mean ofSimilarities \L WLWL WL]WLSnifferBackdoorDDOSLKM91%94%100%91%4.3 Test Results with Malicious ProgramWL LZFigure 5 shows an example of a similarity matrix forcomparing two assembly instructions sequences of sourceprogram with target program. With a threshold fixed at80%, we could detect similar CBs as in the figure. Somedetected blocks were not similar and they should beindicated as false positives. The graph also depicts theresults of our experiment to decide the minimum numberof contiguous blocks that would be meaningful. After alarge number of comparisons (thousands) between CBs,we could find that the false positives decreased by morethan 97%.The graph in Figure 4 depicts the distribution of resultsof calculating similarity with proposed method for 10sample cases. We can regard blocks with high similarityas noise, which could be caused by the inclusion of shortCBs. Such small CBs lack discriminatory power as theycan easily result in a similarity measure close o 100%with other short blocks.]LY G G G Figure 5. Relation between false positives and thenumber of contiguous blocks4.1 Test to Distinguish Conditional StatementXWWL L GX GY GZ G[ G\ G] G G G GXW XWSWWW\WLNext, we conducted an experiment to detect maliciousprograms installed in the real environment with ourdetection method. The experiment involved detecting 4kinds of 15 malicious programs out of 100 normalexecutable files. We detected almost all maliciousexecutable files except one program among the LinuxKernel Module, and we tried to find the reason why itcould not be detected by the proposed method. Afterz Figure 4. Cumulative distribution of false positivesTo discard such false positives resulting from presenceof short CBs, we excluded short CBs containing less thanfive instructions. We could reduce more than 70% offalse positives on an average this way. In addition to this,by taking two blocks as similar only if their similarity is192

observing the file that could not be detected, we found thefile was short and consisted of many conditionalstatements. This allowed us to investigate increasingcontiguous similar blocks to make the matching processmore effective. Table 3 shows the results.programs without executing the program or reverseengineering. In addition, there exists no appropriatemethod, to the best of our knowledge, to detect fragmentsof a malicious executable file which is hidden or deleted.In order to calculate similarity between two executablefiles, we described the assembly instruction sequences asa unique profile. By using the assembly instructionsequence based technique, we could find fragmentssimilar to malicious program even when they are slightlymodified. The method could be used to also identify thenature of executable fragments stored in slack space andfree data blocks. We summarized our experiments tosupport the technique we have proposed and tested.4.4 Detection Rate and the Critical valueThe graph in Figure 6 shows the Receiver OperatingCharacteristic (ROC) curve of the final results. Tomeasure the detection rate using the proposed methods,we tried to find 100 similar files out of 312 executablefiles. Similar files consisted of all possible cases oftransform like insertions, deletions, substitutions, andtranspositions.Acknowledgmentsm Gy G{ This work was supported by the Ministry ofInformation & Communication, Korea, under theInformation Technology Research Center (ITRC) SupportProgram and the University of Pittsburgh, USA.XWU { Gw WUWU WU]WU\ReferencesWU[WUZ[1] H. Chen, W. Chung, J. Jie Xu, Gang Wang, Yi Qin,Michael Chau, “Crime Data Mining: A General Frameworkand Some Example”. In Computer, April, 2004. pp.50-56[2] B. D. Carrier, E. H. Spafford, “Defining EventReconstruction of Digital Crime Scenes,” In Cerias TechReport 2004-37.[3] A. J. Marcella, R. S. Greenfield, “Cyber Forensics,”Auerbach Publications, 2002.[4] L. Garber, “EnCase: A Case Study in Computer-ForensicTechnology,” IEEE Computer Magazine, Jan., 2001,pp202-205.[5] Guidance Software, "EnCase Legal Journal," 2nd Edition,Mar., 2003.[6] D. Farmer, S. John, W. Venema, “The Coroners Toolkit(TCT) v1.12,” http://www.porcupine.org/forensics/tct.html.[7] erias.purdue.edu/homes/carrier/forensics.html,[8] UXWUYWUZWU[WU\WU]m Gw WU WUWU XFigure 6. A test result for determining thresholdThe graph shows the relation between detection rateand the false positives. A good critical point is thesimilarity level after which the inclination of the curve isdecreased suddenly. This is because such sharp turnsindicate that false positives are increasing significantly atthose points. In Figure 6, there are three points we canchoose as a critical values.Table 4. Thresholds and detection ratesThreshold0.80.50.1True Positive76%80%91%False i-brary/cc/dig evi.htm, 2002.[9]. F. Buchholz, E. Spafford, “On the role of file systemmetadata in digital forensics,” In Journal of digitalinvestigation, Dec. 2004.[10] K. J. Jones, R. Bejtlich, C. W. Rose, Real Digital Forensics,Addison-Wesley, 2006.[11] A. Housebolder, K. Houle, C. Daugberty, “ComputerAttack Trends Challenge Internet Security,” In IEEESecurity & Privacy, 2002.[12] Intel Corporation, “The IA-32 Intel ArchitectureSoftware Developer's Manual,” Intel Corporation, 2003.[13] A. Chuvakin, “Linux Data Hiding and Recovery,”http://www.linuxsecurity.com/feature stories/data-hidingforensics.html, October, 2002.[14] J. R. Vacca, “Computer Forensics : Computer Crime SceneInvestigation,” Charles River media, 2002.[15] E. Mays, F.J. Damerau, R.L. Mercer. Context-basedspelling correction. In information Proceeding andManagement, 27(5):517-522, 1991.With this experiment we found that 80% of detectionrate and 3% of false positives are possible with 0.5 ofsimilarity. This means we can decide two files are similarwhen their similarity measure is over 0.5. By controllingthe critical values for their environments, the forensicsanalysts may learn deeper insights into the characteristicsof the malware being analyzed.5. ConclusionIn this paper, we presented a method to find outwhether a detected executable file is similar to amalicious executable file by comparing assemblyinstruction sequences. Using the existing techniques, wewould not be able to detect any modified malicious193

3Division of Electronics Computer & Information Engineering Chonnam National University, Korea bbong@jnu.ac.kr Abstract With the rapidly increasing complexity of computer systems and the sophistication of hacking tools and techniques, there is a crucial need for computer forensic analysis techniques. Very few techniques exist to support