Data Mining Methods For Malware Detection

Transcription

DATA MINING METHODS FOR MALWARE DETECTIONbyMUAZZAM AHMED SIDDIQUIB.E. NED University of Engineering and Technology, 1999M.S. University of Central Florida, 2004A dissertation submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophyin Modeling and Simulationin the College of Sciencesat the University of Central FloridaOrlando, FloridaSummer Term2008Major Professor: Morgan C. Wang

c 2008 Muazzam Ahmed Siddiquiii

ABSTRACTThis research investigates the use of data mining methods for malware (malicious programs) detection and proposed a framework as an alternative to the traditional signature detection methods.The traditional approaches using signatures to detect malicious programs fails for the new and unknown malwares case, where signatures are not available. We present a data mining framework todetect malicious programs. We collected, analyzed and processed several thousand malicious andclean programs to find out the best features and build models that can classify a given programinto a malware or a clean class. Our research is closely related to information retrieval and classification techniques and borrows a number of ideas from the field. We used a vector space modelto represent the programs in our collection. Our data mining framework includes two separateand distinct classes of experiments. The first are the supervised learning experiments that used adataset, consisting of several thousand malicious and clean program samples to train, validate andtest, an array of classifiers. In the second class of experiments, we proposed using sequential association analysis for feature selection and automatic signature extraction. With our experiments,we were able to achieve as high as 98.4% detection rate and as low as 1.9% false positive rate onnovel malwares.iii

To Asma, without you, this journey would not have been possible.To Erum, for always keeping faith in me.To Ammi & Abbu, for all your prayers.iv

TABLE OF CONTENTSLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviLIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiCHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .11.1Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.1.1Computer Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.1.2Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.1.3Spam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21.1.4Phishing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.1.5Exploits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.2.1Virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31.2.2Worm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2.3Trojan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2.4Spyware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.2.5Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4Virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41.3.1Classification by Target . . . . . . . . . . . . . . . . . . . . . . . . . . . .51.3.1.151.21.3Boot Sector Virus . . . . . . . . . . . . . . . . . . . . . . . . .v

1.3.1.2File Virus . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51.3.1.3Macro Virus . . . . . . . . . . . . . . . . . . . . . . . . . . . .5Classification by Self-Protection Strategy . . . . . . . . . . . . . . . . . .51.3.2.1No Concealment . . . . . . . . . . . . . . . . . . . . . . . . . .61.3.2.2Code Obfuscation . . . . . . . . . . . . . . . . . . . . . . . . .61.3.2.3Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61.3.2.4Polymorphism . . . . . . . . . . . . . . . . . . . . . . . . . . .61.3.2.5Metamorphism . . . . . . . . . . . . . . . . . . . . . . . . . . .71.3.2.6Stealth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7Worm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71.4.1Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71.4.1.1Human Activation . . . . . . . . . . . . . . . . . . . . . . . . .81.4.1.2Human Activity-Based Activation . . . . . . . . . . . . . . . . .81.4.1.3Scheduled Process Activation . . . . . . . . . . . . . . . . . . .81.4.1.4Self Activation . . . . . . . . . . . . . . . . . . . . . . . . . . .8Payload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81.4.2.1None . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81.4.2.2Internet Remote Control . . . . . . . . . . . . . . . . . . . . . .91.4.2.3Spam-Relays . . . . . . . . . . . . . . . . . . . . . . . . . . . .9Target Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91.4.3.1Scanning Worms . . . . . . . . . . . . . . . . . . . . . . . . . .91.4.3.2Flash Worms . . . . . . . . . . . . . . . . . . . . . . . . . . . .91.3.21.41.4.21.4.3vi

1.4.41.51.61.4.3.3Metaserver Worms . . . . . . . . . . . . . . . . . . . . . . . . .91.4.3.4Topological Worms . . . . . . . . . . . . . . . . . . . . . . . . 101.4.3.5Passive Worms . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.4.1Self-Carried . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.4.2Second Channel . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.4.3Embedded . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Trojans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5.1How Trojans Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.5.2Trojan Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.2.1Remote Access Trojans . . . . . . . . . . . . . . . . . . . . . . 131.5.2.2Password Sending Trojans . . . . . . . . . . . . . . . . . . . . . 131.5.2.3Keyloggers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5.2.4Destructive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.5.2.5Mail-Bomb Trojan . . . . . . . . . . . . . . . . . . . . . . . . . 141.5.2.6Proxy/Wingate Trojans . . . . . . . . . . . . . . . . . . . . . . 141.5.2.7FTP Trojans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5.2.8Software Detection Killers . . . . . . . . . . . . . . . . . . . . 15Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15CHAPTER 2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . 172.1Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18vii

2.1.12.1.22.22.3Detection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.1.1Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.1.1.2Activity Monitoring . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1.3Integrity Checking . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.1.4Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Feature Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.1.2.1N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.2.2API/System calls . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.2.3Assembly Instructions . . . . . . . . . . . . . . . . . . . . . . . 202.1.2.4Hybrid Features . . . . . . . . . . . . . . . . . . . . . . . . . . 202.1.2.5Network Connection Records . . . . . . . . . . . . . . . . . . . 212.1.3Analysis Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.1.4Detection Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Scanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.1Static Misuse Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.2Hybrid Misuse Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Activity Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.1API/System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.1.1Static Anomaly Detection . . . . . . . . . . . . . . . . . . . . . 242.3.1.2Hybrid Anomaly Detection . . . . . . . . . . . . . . . . . . . . 252.3.1.3Static Misuse Detection . . . . . . . . . . . . . . . . . . . . . . 252.3.1.4Dynamic Anomaly Detection . . . . . . . . . . . . . . . . . . . 26viii

2.3.22.3.32.3.1.5Hybrid Misuse Detection . . . . . . . . . . . . . . . . . . . . . 272.3.1.6Dynamic Misuse Detection . . . . . . . . . . . . . . . . . . . . 27Hybrid Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.2.1Static Misuse Detection . . . . . . . . . . . . . . . . . . . . . . 272.3.2.2Dynamic Misuse Detection . . . . . . . . . . . . . . . . . . . . 28Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.3.3.12.4Integrity Checkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.4.12.5Dynamic Anomaly Detection . . . . . . . . . . . . . . . . . . . 28Static Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 29Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.5.12.5.22.5.3N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.5.1.1Dynamic Misuse Detection . . . . . . . . . . . . . . . . . . . . 312.5.1.2Dynamic Hybrid Detection . . . . . . . . . . . . . . . . . . . . 312.5.1.3Static Anomaly Detection . . . . . . . . . . . . . . . . . . . . . 322.5.1.4Static Hybrid Detection . . . . . . . . . . . . . . . . . . . . . . 322.5.1.5Static Misuse Detection . . . . . . . . . . . . . . . . . . . . . . 35API/System calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.5.2.1Dynamic Anomaly Detection . . . . . . . . . . . . . . . . . . . 352.5.2.2Static Hybrid Detection . . . . . . . . . . . . . . . . . . . . . . 36Assembly Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.5.3.1Static Misuse Detection . . . . . . . . . . . . . . . . . . . . . . 372.5.3.2Static Hybrid Detection . . . . . . . . . . . . . . . . . . . . . . 38ix

2.5.4Hybrid Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.5.4.1Static Anomaly Detection . . . . . . . . . . . . . . . . . . . . . 382.5.4.2Static Hybrid Detection . . . . . . . . . . . . . . . . . . . . . . 392.5.4.3Hybrid Hybrid Detection . . . . . . . . . . . . . . . . . . . . . 40CHAPTER 3 METHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.1Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.3Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.4Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.53.4.1Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.2Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.4.3Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.4.4Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.4.5Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.4.6Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4.7Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 48Performance Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48CHAPTER 4 SUPERVISED LEARNING EXPERIMENTS . . . . . . . . . . . . . 504.1Subset Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.1.1Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.1.1.1Malware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 53x

4.1.24.1.34.24.1.1.2Disassembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.1.3Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.1.1.4Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 544.1.1.5Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 564.1.1.6Independence Test . . . . . . . . . . . . . . . . . . . . . . . . . 56Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.2.1Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 574.1.2.2Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.2.3Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 57Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Trojans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2.14.2.2Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.1.1Malware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.1.2File Size Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 614.2.1.3Disassembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2.1.4Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.2.1.5Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 634.2.1.6Primary Feature Selection . . . . . . . . . . . . . . . . . . . . . 634.2.1.7Independence Test . . . . . . . . . . . . . . . . . . . . . . . . . 644.2.1.8Secondary Feature Selection/Reduction . . . . . . . . . . . . . . 64Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2.2.1Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65xi

4.3Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2.2.3Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 654.2.3Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2.4Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Worms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.3.14.3.24.44.2.2.2Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.3.1.1Malware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 694.3.1.2Disassembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3.1.3Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 704.3.1.4Feature Selection/Reduction . . . . . . . . . . . . . . . . . . . . 714.3.1.5Independence Test . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.1.6Feature Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.1.7Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.3.1.8Principal Component Analysis . . . . . . . . . . . . . . . . . . 73Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.3.2.1Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744.3.2.2Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74CHAPTER 5 UNSUPERVISED LEARNING EXPERIMENTS . . . . . . . . . . . 875.1Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.1.1Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88xii

5.1.25.1.35.1.1.1Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 885.1.1.2Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . 89Sequence Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.1.2.1Feature Extraction Process . . . . . . . . . . . . . . . . . . . . 895.1.2.2Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.1.2.3Independence Test . . . . . . . . . . . . . . . . . . . . . . . . . 90Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.1.3.15.1.45.2Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91Automatic Malware Signature Extraction . . . . . . . . . . . . . . . . . . . . . . 925.2.1Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2.1.15.2.25.2.3Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . 93Signature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2.2.1Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 945.2.2.2Modified Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 96Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99CHAPTER 6 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101CHAPTER 7 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.1Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.2Signature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037.3PE Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103xiii

7.4Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.5Real Time Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.5.1Time and Space Complexity: . . . . . . . . . . . . . . . . . . . . . . . . . 1047.5.2Cost Sensitive Learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . 1047.5.3Class Distribution: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.6Combining Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.7More Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106xiv

LIST OF FIGURES4.1Typical 32-Bit Portable .EXE File Layout . . . . . . . . . . . . . . . . . . . . . . 514.2Portion of the output of disassembled actmovie.exe. . . . . . . . . . . . . . . . . . 534.3Instruction sequences extracted from actmovie.exe . . . . . . . . . . . . . . . . . 544.4Scatter plot of sequence frequencies and their frequency of occurrence. . . . . . . . 554.5Scatter plot of sequence frequencies and their frequency of occurrence with outliersremoved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6Data preprocessing steps for experiments with the malware subset. . . . . . . . . . 564.7ROC curve comparing regression, neural network and decision tree training results. 584.8ROC curve comparing regression, neural network and decision tree test results. . . 594.92007 Malware distribution statistics from Trend Micro. . . . . . . . . . . . . . . . 604.10 Portion of the output of disassembled Win32.Flood.A trojan. . . . . . . . . . . . . 634.11 Instruction sequences extracted from the disassembled Win32.Flood.A trojan. . . . 634.12 Data preprocessing steps for experiments with trojans. . . . . . . . . . . . . . . . 644.13 ROC curve comparing random forest test results on datasets with all variables, RFvariables and PCA variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.14 ROC curve comparing random forest, bagging and SVM test results on datasetwith RF selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.15 Portion of the output of disassembled Netsky.A worm. . . . . . . . . . . . . . . . 704.16 Instruction sequences extracted from the disassembled Netsky.A worm. . . . . . . 714.17 Data preprocessing steps for experiments with worms. . . . . . . . . . . . . . . . 71xv

4.18 ROC curves comparing random forest results for each n-gram size using all variables. 794.19 ROC curves comparing random forest results for each n-gram size using randomforest feature selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.20 ROC curves comparing random forest results for each n-gram size using PCA feature reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.21 ROC curves comparing bagging results for each n-gram size using all variables. . . 814.22 ROC curves comparing bagging results for each n-gram size using random forestfeature selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.23 ROC curves comparing bagging results for each n-gram size using PCA featurereduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.24 ROC curves comparing random forest and bagging results using random forestfeature selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.25 ROC curves comparing the variable selection methods (all variables, RF variables,PCA variables) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.1ROC curves comparing the results for the features selected using sequence association mining only and the combined feature set. . . . . . . . . . . . . . . . . . . . 92xvi

LIST OF TABLES4.1Experimental results for new and unknown viruses. . . . . . . . . . . . . . . . . . 594.2Area under the ROC curve for each classifier. . . . . . . . . . . . . . . . . . . . . 604.3Packers/Compilers Analysis of Trojans . . . . . . . . . . . . . . . . . . . . . . . . 614.4Packers/Compilers Analysis of Trojans and Clean Programs . . . . . . . . . . . . 614.5File Size Analysis of the Program Collection . . . . . . . . . . . . . . . . . . . . . 624.6Experimental results for new and unknown trojans. . . . . . . . . . . . . . . . . . 664.7Area under the ROC curve for each classifier for each dataset. . . . . . . . . . . . 674.8Packers/Compilers Analysis Details of Worms and Clean Programs . . . . . . . . . 694.9Packers/Compilers Analysis Summary of Worms and Clean Programs . . . . . . . 694.10 Features Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.11 Reduced Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.12 Experimental results for new and unknown worms. . . . . . . . . . . . . . . . . . 754.13 Area under the ROC curve for each classifier. . . . . . . . . . . . . . . . . . . . . 835.1Experimental results for new and unknown malwares using sequential associationanalysis as well as a combined selection method. . . . . . . . . . . . . . . . . . . 915.2Truth table for the basic algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 955.3Truth table for the modified algorithm for clean files. . . . . . . . . . . . . . . . . 975.4Truth table for the modified algorithm for malwares. . . . . . . . . . . . . . . . . . 985.5Truth table for the modified algorithm for malwares and clean programs . . . . . . 98xvii

5.6Experimental results for new and unknown malwares using automatically extractedsignatures by applying sequential association mining. . . . . . . . . . . . . . . . . 100xviii

CHAPTER 1INTRODUCTIONComputer virus detection has evolved into malicious program detection since Cohen first formalized the term computer virus in 1983 [Coh85]. Malicious programs can be classified into viruses,worms, trojans, spywares, adwares and a variety of other classes and subclasses that sometimesoverlap and blur the boundaries among these classes [Szo05]. Both traditional signature baseddetection and generalized approaches can be used to identify these malicious programs. To avoiddetection by the traditional signature-based algorithms, a number of stealth techniques have beendeveloped by the malicious code writers. The inability of traditional signature based detectionapproaches to catch these new breed of malicious programs has shifted the focus of virus researchto find more generalized and scalable features that can identify malicious behavior as a processinstead of a single signature action.We present a data mining approach to the problem of malware detection. Data mining hasbeen a recent focus of malware detection research. Section 2.5 presents a literature review of thedata mining techniques used for virus and worm detection. Our approach combines the ease of anautomated syntactic analysis with the power of semantic level analysis requiring human expertiseto reflect critical program behavior that can classify between malicious and benign programs. Weperformed initial experiments and developed a data mining framework that serves as a platform forfurther research.1

1.1 DefinitionsThis section introduces various computer security terms to the reader. Since our work deals withcomputer virus and worms, a more detailed account of these categories is presented than any otherarea in security.1.1.1 Computer SecurityComputer security is the effort to create a secure computing platform, designed so that agents(users or programs) can only perform actions that have been allowed.1.1.2 MalwareAny program that is purposefully created to harm the computer system operations or data is termedas malicious programs. Malicious programs include viruses, worms, trojans, backdoors, adwares,spywares, bots, rootkits etc. All malwares are sometimes loosely termed as virus (viruses, worms,trojans specifically) Commercial anti-malware products are still called antivirus.1.1.3 SpamSpam is abuse of electronic messaging systems to send unsolicited bulk messages Most widelyrecognized form is email spam. Also includes IM spam, blog spam, discussion forum spam, cellphone messaging spam etc.2

1.1.4 PhishingPhishing can be defined as a criminal activity using social engineering techniques. The mostcommon example of phishing is an email asking to enter account/credit card information for ecommerce websites (ebay, amazon etc) and online banking.1.1.5 ExploitsAn exploit is a piece of software, a chunk of data, or sequence of commands that take advantageof a bug, glitch or vulnerability in order to cause unintended or unanticipated behavior to occuron computer software, hardware, or something electronic (usually computerized). This frequentlyincludes such things as gaining control of a computer system or allowing privilege escalation or adenial of service attack.1.2 MalwareMalware is a relatively new term that gets its name from malicious software. Many users are stillunfamiliar with the term. Most of the time all malware are grouped under the umbrella term virus.For the detection of malware, the industry still use the term antivirus not antimalware, though theproduct caters all malware.1.2.1 VirusComputer virus is a self replicating code (including possibly evolved copies of itself) that infectsother executable programs. Viruses usually need human intervention for replication and execution.3

1.2.2WormComputer worm is a self replicating stand alone program that spreads on computer networks.Worms usually do not need any extra help from a user to replicate and execute.1.2.3TrojanTrojan horses or simply trojans are programs that perform some malicious activity under the guiseof some normal program. The term is derived from the classical myth of the Trojan Horse.1.2.4 SpywareSpyware is computer software that is installed surreptitiously on a personal computer to interceptor take partial control over the user’s interaction with the computer, without the user’s informedconsent.1.2.5 OthersOther malware programs include logic bombs, that are an intended malfunction of a legitimateprogram, rootkits, that are a set of hacker tools intended to conceal running process from operatingsystem, backdoor, that is a method to bypass normal computer authentication.1.3VirusComputer virus is a self replicating code (including possibly evolved copies of itself) that infectsother executable programs. Viruses usually need human intervention for replication and execution.4

1.3.1 Classification by TargetThis section define target as the means exploited by the virus for execution. Based upon the targetviruses can be classified into three major classes.1.3.1.1Boot Sector VirusMaster Boot Record (Boot sector in DOS) is a piece of code that runs every time a computersystem is booted. Boot sector virus infect the MBR on the disk, hence getting the privilege ofgetting executed every time the computer system starts up.1.3.1.2File VirusFile virus is the most common form of viruses. They infect the file system on a computer. Filevirus infect executable programs and are executed every time the infected program is run.1.3.1.3Macro VirusMacro virus infect documents and templates instead of executable programs. It is written in amacro programming language that is built into applications like Microsoft Word or Excel. Macrovirus can be automatically executed every time the document is opened with the application.1.3.2 Classification by Self-Protection StrategySelf-protection strategy can be defined as the technique used by a virus to avoid detection. Inother words, the anti-antivirus techniques. Based upon self-protection strategies, viruses can beclassified into the following categories.5

1.3.2.1No ConcealmentBased upon the self-protection strategy the first category can be defined as the one without anyconcealment. The virus code is clean without any garbage instructions or encryption.1.3.2.2Code ObfuscationCode obfuscation is a technique developed to avoid specific-signature detection. These includeadding no-op instructions, unnecessary jumps etc, so the virus code look muddled and the signaturefails.1.3.2.3EncryptionThe next line of defense by the virus writers to defeat signature detection was code encryption.Encrypted viruses use an encrypted virus bo

DATA MINING METHODS FOR MALWARE DETECTION by MUAZZAM AHMED SIDDIQUI B.E. NED University of Engineering and Technology, 1999 . 4.6 Data preprocessing steps for experiments with the malware subset. . . . . . . . . . 56 4.7 ROC curve comparing regression, neural network and decision tree training results. 58 .