Mining Security Risks From Massive Datasets - Virginia Tech

Transcription

Mining Security Risks from Massive DatasetsFang LiuDissertation submitted to the Faculty of theVirginia Polytechnic Institute and State Universityin partial fulfillment of the requirements for the degree ofDoctor of PhilosophyinComputer Science & ApplicationDanfeng Yao, ChairWenjing LouAli R. ButtB. Aditya PrakashDongyan XuJune 28, 2017Blacksburg, VirginiaKeywords: Cyber Security, Big Data Security, Mobile Security, Data Leakage DetectionCopyright 2017, Fang Liu

Mining Security Risks from Massive DatasetsFang Liu(ABSTRACT)Cyber security risk has been a problem ever since the appearance of telecommunication andelectronic computers. In the recent 30 years, researchers have developed various tools toprotect the confidentiality, integrity, and availability of data and programs.However, new challenges are emerging as the amount of data grows rapidly in the big dataera. On one hand, attacks are becoming stealthier by concealing their behaviors in massivedatasets. One the other hand, it is becoming more and more difficult for existing tools tohandle massive datasets with various data types.This thesis presents the attempts to address the challenges and solve different security problems by mining security risks from massive datasets. The attempts are in three aspects:detecting security risks in the enterprise environment, prioritizing security risks of mobileapps and measuring the impact of security risks between websites and mobile apps. First, thethesis presents a framework to detect data leakage in very large content. The framework canbe deployed on cloud for enterprise and preserve the privacy of sensitive data. Second, thethesis prioritizes the inter-app communication risks in large-scale Android apps by designingnew distributed inter-app communication linking algorithm and performing nearest-neighborrisk analysis. Third, the thesis measures the impact of deep link hijacking risk, which is onetype of inter-app communication risks, on 1 million websites and 160 thousand mobile apps.The measurement reveals the failure of Google’s attempts to improve the security of deeplinks.

Mining Security Risks from Massive DatasetsFang Liu(GENERAL AUDIENCE ABSTRACT)Cyber security risk has been a problem ever since the appearance of telecommunication andelectronic computers. In the recent 30 years, researchers have developed various tools toprevent sensitive data from being accessed by unauthorized users, protect program and datafrom being changed by attackers, and make sure program and data to be available wheneverneeded.However, new challenges are emerging as the amount of data grows rapidly in the big dataera. On one hand, attacks are becoming stealthier by concealing their attack behaviors inmassive datasets. On the other hand, it is becoming more and more difficult for existingtools to handle massive datasets with various data types.This thesis presents the attempts to address the challenges and solve different security problems by mining security risks from massive datasets. The attempts are in three aspects:detecting security risks in the enterprise environment where massive datasets are involved,prioritizing security risks of mobile apps to make sure the high-risk apps being analyzedfirst and measuring the impact of security risks within the communication between websitesand mobile apps. First, the thesis presents a framework to detect sensitive data leakage inenterprise environment from very large content. The framework can be deployed on cloudfor enterprise and avoid the sensitive data being accessed by the semi-honest cloud at thesame time. Second, the thesis prioritizes the inter-app communication risks in large-scaleAndroid apps by designing new distributed inter-app communication linking algorithm andperforming nearest-neighbor risk analysis. The algorithm runs on a cluster to speed up thecomputation. The analysis leverages each app’s communication context with all the otherapps to prioritize the inter-app communication risks. Third, the thesis measures the impactof mobile deep link hijacking risk on 1 million websites and 160 thousand mobile apps. Mobile deep link hijacking happens when a user clicks a link, which is supposed to be openedby one app but being hijacked by another malicious app. Mobile deep link hijacking is onetype of inter-app communication risks between mobile browser and apps. The measurementreveals the failure of Google’s attempts to improve the security of mobile deep links.

AcknowledgmentsI would like to express my deepest gratitude to my mentor Dr. Danfeng (Daphne) Yao forproviding me the opportunity to be part of the Yao group! She introduced me to cybersecurity research and had since guided me through the most interesting and exciting research topics in the past five years. I have learned so much over the past five years fromDr. Yaos knowledge and expertise, and the knowledge and skills gained in this learningexperience build the most solid foundation for my dissertation work. I also greatly appreciate Dr. Yaos kindness and generosity. She is always available to provide insights on myresearch when I am lost and offer support when I am discouraged. I would not have survivedthe past five challenging years of graduate school without her encouragement and inspiration.I would also like to express my sincere gratitude to Dr. Wenjing Lou, Dr. Ali R. Butt,Dr. B. Aditya Prakash, Dr. Dongyan Xu, and Dr. Babara Ryder who generously contributed much time and energy to help improve my dissertation. I have benefited a lot fromtheir diverse perspectives and insightful comments which help both deepen and broaden myviews in my research. I also want to thank Dr. Gang Wang for providing guidance on themobile deep link project. It was such a valuable and enjoyable learning experience despiteit being short in duration.I would like to thank my friends and peers who I worked with over the past five years:Dr. Kui Xu, Dr. Xiaokui Shu, Dr. Hao Zhang, Dr. Karim Elish, Dr. Haipeng Cai, DanielBarton, Ke Tian, Dr. Long Cheng, Sazzadur Rahaman, Stefan Nagy, Alex Kedrowitsch,Andres Pico, Bo Li, Yue Cheng, Kaixi Hou, Zheng Song, Tong Zhang, Qingrui Liu andXinwei Fu. Their companionship has brought so much joy into the challenging and stressfulgraduate school life!Finally, I would like to thank my parents, my sister, and my wife for always having faith inme and loving me. Their consistent support instills in me the strongest motivation to keeplearning more, working harder, and getting better.iv

ContentsList of FiguresixList of Tablesxii1 Introduction11.1Detect Data Leakage in Large Datasets . . . . . . . . . . . . . . . . . . . . .21.2Prioritize Inter-communication Risks in Large-scale Android Apps . . . . . .31.3Measure Hijacking Risks of Mobile Deep Links . . . . . . . . . . . . . . . . .41.4Document Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42 Review of Literature62.1Data Leakage Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.2Mobile Inter-app Communication . . . . . . . . . . . . . . . . . . . . . . . .93 Detect Data Leakage3.13.212Threat Model, Security and Computation goals . . . . . . . . . . . . . . . .143.1.1Threat Model and Security Goal . . . . . . . . . . . . . . . . . . . . .153.1.2Computation Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . .163.1.3Confidentiality of Sensitive Data . . . . . . . . . . . . . . . . . . . . .17Technical Requirements and Design Overview . . . . . . . . . . . . . . . . .183.2.1MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183.2.2MapReduce-Based Design and Challenges . . . . . . . . . . . . . . .193.2.3Workload Distribution . . . . . . . . . . . . . . . . . . . . . . . . . .20v

3.2.43.33.43.5Detection Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . .Collection Intersection in MapReduce. . . . . . . . . . . . . . . . . . . . .223.3.1Divider Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .223.3.2Reassembler Algorithm . . . . . . . . . . . . . . . . . . . . . . . .243.3.3Example of the Algorithms . . . . . . . . . . . . . . . . . . . . . . . .253.3.4Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .27Security Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . . . .273.4.1Privacy Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . . . .283.4.2Collisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .283.4.3Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29Implementation and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .313.5.1Performance of A Single Host . . . . . . . . . . . . . . . . . . . . . .323.5.2Optimal Size of Content Segment . . . . . . . . . . . . . . . . . . . .333.5.3Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .343.5.4Performance Impact of Sensitive Data. . . . . . . . . . . . . . . . .353.5.5Plain Text Leak Detection Accuracy . . . . . . . . . . . . . . . . . .363.5.6Binary Leak Detection Accuracy . . . . . . . . . . . . . . . . . . . .364 Prioritize Inter-app Communication Risks4.14.22139Models & Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .414.1.1Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .414.1.2Security Insights of Large-ScaleInter-app Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .424.1.3Computational Goal . . . . . . . . . . . . . . . . . . . . . . . . . . .454.1.4The Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46Distributed ICC Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . .464.2.1Identify ICC Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . .474.2.2Identify ICC Edges and Tests . . . . . . . . . . . . . . . . . . . . . .474.2.3Multiple ICCs Per App Pair . . . . . . . . . . . . . . . . . . . . . . .48vi

4.2.4Workload Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .494.2.5Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .49Neighbor-based Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .504.3.1Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .504.3.2Hijacking/Spoofing Risk . . . . . . . . . . . . . . . . . . . . . . . . .514.3.3Collusion Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .544.4.1Q1: Results of Risk Assessment . . . . . . . . . . . . . . . . . . . . .544.4.2Q2: Manual Validation . . . . . . . . . . . . . . . . . . . . . . . . . .574.4.3Q3: Attack Case Studies . . . . . . . . . . . . . . . . . . . . . . . . .584.4.4Q4: Runtime of MR-Droid . . . . . . . . . . . . . . . . . . . . . . . .594.5Security Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . .604.6Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .614.34.45 Measure Mobile Deep Link Risks5.162Background and Research Goals . . . . . . . . . . . . . . . . . . . . . . . . .635.1.1Mobile Deep Links . . . . . . . . . . . . . . . . . . . . . . . . . . . .645.1.2Security Risks of Deep Linking . . . . . . . . . . . . . . . . . . . . .645.1.3Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . .665.2Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .685.3Deep Link Registration by Apps . . . . . . . . . . . . . . . . . . . . . . . . .695.3.1Extracting URI Registration Entries . . . . . . . . . . . . . . . . . .695.3.2Scheme URL vs. App Link . . . . . . . . . . . . . . . . . . . . . . . .70Security Analysis of App Links . . . . . . . . . . . . . . . . . . . . . . . . .715.4.1App Link Verification . . . . . . . . . . . . . . . . . . . . . . . . . . .715.4.2Over-Permission Vulnerability . . . . . . . . . . . . . . . . . . . . . .735.4.3Summary of Vulnerable Apps. . . . . . . . . . . . . . . . . . . . . . .74Link Hijacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .755.5.1755.45.5Characterizing Link Collision . . . . . . . . . . . . . . . . . . . . . .vii

5.65.75.5.2Detecting Malicious Hijacking . . . . . . . . . . . . . . . . . . . . . .765.5.3Hijacking Results and Case Studies . . . . . . . . . . . . . . . . . . .79Mobile Deep Links on The Web . . . . . . . . . . . . . . . . . . . . . . . . .815.6.1Intent URL Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . .815.6.2Measuring Hijacking Risk on Web . . . . . . . . . . . . . . . . . . . .82Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .846 Conclusions and Future Work86Bibliography88viii

List of Figures3.1An example illustrating the difference between set intersection and collectionintersection in handling duplicates for 3-grams. . . . . . . . . . . . . . . . .16Overview of MapReduce execution process. Map takes each hkey, valuei pairas input and generates new hkey, valuei pairs. The output hkey, valuei pairsare redistributed according to the keys. Each reduce processes the list of valueswith the same key and writes results back to DFS. . . . . . . . . . . . . . .193.3Workload distribution of DLD provider and data owner. . . . . . . . . . . .213.4An example illustrating Divider and Reassembler algorithms, with fourMapReduce nodes, two content collections C1 and C2 , and two sensitive datacollections S1 and S2 . M, R, Redi stand for map, reduce, and redistribution,respectively. hkey, valuei of each operation is shown at the top. . . . . . . .25DLD provider’s throughput with different sizes of content segments. For eachsetup (line), the size of the content analyzed is 37 GB. . . . . . . . . . . . .31Data owner’s pre-processing overhead on a workstation with different sizes ofcontent segments. The total size of content analyzed is 37 GB. . . . . . . . .31Throughputs with different numbers of nodes on a local cluster and AmazonEC2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31Throughput with different amounts of content workload. Each content segment is 37 MB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34Runtime of Divider and Reassembler algorithms. The Divider operationtakes 85% to 98% of the total runtime. The Y-axis is in 10 based log scale.343.10 Runtime with a varying size of sensitive data. The content volume is 37.5 GBfor each setup. Each setup (line) has a different size for content segments. .343.11 Runtimes of Divider and Reassembler with different values of γ. The runtimesdrop linearly when the value of γ goes small. The result is consistent with thecomplexity analysis in Table 3.2. . . . . . . . . . . . . . . . . . . . . . . . . .373.23.53.63.73.83.9ix

3.12 Plain text leak detection accuracy with different thresholds. The detectionrate is 1 with very low false positive rate when the threshold ranges from 0.18to 0.8. In Figure (b), the false positive rates of the two lines are very close. .373.13 Binary leak detection accuracy with different thresholds. The detection rateis 1 with very low false positive rate when the threshold ranges from 0.35 to0.9. In Figure (a), the detection rates of the four lines are very close. . . . .374.1Illustration of data flows, inter-app Intent matching, and ICC graph. MyMapReduce algorithms compute the complete ICC graph information for aset of apps. internal ICCs are excluded. . . . . . . . . . . . . . . . . . . . . .44The workflow for analyzing Android app pairs with MapReduce. The dashedvertical lines indicate redistribution processes in MapReduce. E, I, S representexplicit edge, implicit edge, and sharedUserId edge respectively. . . . . . . .454.3Feature value distribution and classification. . . . . . . . . . . . . . . . . . .514.4The heat maps of the number of app pairs across app categories. The ticklabel indexes represent 24 categories. Detailed index-category mapping inTable 4.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .564.5Analysis time of the three phases in my approach. . . . . . . . . . . . . .605.1Three types of mobile deep links: Scheme URL, App Link and Intent URL. .645.2URI syntax for Scheme URLs and App links. . . . . . . . . . . . . . . . . . .645.3App link verification process. . . . . . . . . . . . . . . . . . . . . . . . . .655.4# of new schemes and app link hosts per app between 2014 and 2016. . . . .705.5% of apps w/deep links; apps are divided by download count. . . . . . . .705.6# of Collision apps per scheme. . . . . . . . . . . . . . . . . . . . . . . . . . .765.7# of Collision apps per web host. . . . . . . . . . . . . . . . . . . . . . . . . .765.8# of collision apps per scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . .795.9# of collision apps per host. . . . . . . . . . . . . . . . . . . . . . . . . . . . .795.10 Deep link distribution among Alexa top 1 million websites. Website domainsare sorted and divided into 20 even-sized bins (50K sites per bin). I reportthe % of websites that contain deep links in each bin. . . . . . . . . . . . .805.11 Number of websites that host deep links for each app. . . . . . . . . . . . . .815.12 Different type of hijacked deep links in Alexa1M. . . . . . . . . . . . . . . .814.2x

5.13 Webpages that contain hijacked deep links in Alexa1M. . . . . . . . . . . . .xi81

List of Tables3.1Notations used in my MapReduce algorithms. . . . . . . . . . . . . . . . . .173.2Worst-case computation and communication complexity of each phase in myMapReduce algorithm and that of the conventional hashtable-based approach.I denote the average size of a sensitive data collection by S, the average sizeof a content collection by C, the number of sensitive data collections by m,the number of content collections by n, and the average intersection rate byγ [0, 1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27Detection throughputs (Mbps) on a single host with different content sizes andsensitive data sizes. * indicates that the system crashes before the detection isfinished. Note that, for shingle size to be 8, the size of input for the hashtableis 8 times the size of content. . . . . . . . . . . . . . . . . . . . . . . . . . .333.4Setup of binary leak detection. . . . . . . . . . . . . . . . . . . . . . . . . . .384.1Nodes in ICC graph and their attributes. type is either “Activity”, “Service”or “Broadcast”. pointType is either “Explicit”, “Implicit” or “sharedUserId”. 464.2Tests completed at different MapReduce phases and the edges that the testsare applicable to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.3484.3Summary of how three types of edges in ICC graph are obtained based onsource and sink information and transformed into key, value pairs in Map1 . 484.4Worst-case complexity of different phases of my approach . . . . . . . . . . .494.5Features in my security risk assessment . . . . . . . . . . . . . . . . . . . . .504.6Numbers of apps vulnerable to Intent spoofing/hijacking attacks and potentially colluding app pairs reported by my technique, and percentages (in parentheses) manually validated as indeed vulnerable or colluding, per risk categoryand level. The DroidBench test result is not included in this table. . . . . .53App categories in the evaluation dataset . . . . . . . . . . . . . . . . . . . .554.7xii

5.15.2Conditions for whether users will be prompted after clicking a deep link on Android. App Links always have at least one matched app, the mobile browser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67Two snapshots of Android apps collected in 2014 and 2016. 115,399 appsappear in the both datasets; 48,923 apps in App2014 are no longer listed onthe market in 2016; App2016 has 49,564 new apps. . . . . . . . . . . . . . .685.3App Link verification statistics and common mistakes (App2016) based ondata from January 2017 and May 2017. One app can make multiple mistakes. 725.4Association files for iOS and Android obtained after scanning 1,012,844 domains. 735.5Top 10 schemes and app link hosts with link collisions in App2016. I manuallylabel them into three types: F Functional, P Per-App, T Third-party775.6Filtering and classification results for schemes and App link hosts (App2016).775.7Features used for scheme classification. . . . . . . . . . . . . . . . . . . . . .785.8Number of deep links (and webpages that contain deep links) in Alexa top 1million web domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81Sensitive parameters in mobile deep links. . . . . . . . . . . . . . . . . . . .835.9xiii

xiv

Chapter 1IntroductionCyber security has been a study topic ever since the appearance of telecommunication andelectronic computer. A vulnerability could lead to the compromise of confidentiality (e.g.,data leak), integrity (e.g., authentication bypass) and availability (e.g., denial of service) ofdata or program. Due to the prevalence of vulnerabilities, security risks are everywhere.For example, ComDroid [53] reported that over 97% of the studied apps are vulnerable tocommunication hijacking attacks. One of the direct consequence of the security risks is dataleakage. In the year of 2014 only, over 1 billion data records have been leaked as reported bySafeNet[32], a data protection company. The leaked records include email addresses, creditcard numbers, etc. A survey conducted by Kaspersky Lab on 4438 companies indicates thatabout 83% of the companies experienced data leak incidents [84].Although various tools have been developed in the recent 30 years to detect and fix thesecurity risks, new challenges are emerging in the big data era. First, new attacks arebecoming stealthier. They conceal their behaviors in large datasets. It is almost impossible todetect their malicious behaviors without correlating the data from multiple sources. Second,the size of datasets for mining security risks are growing rapidly. Existing approaches runningon one single host are not able to process huge datasets. New detection algorithms runningon distributed systems are needed for mining security risks. Third, the exposure of largedatasets poses threats to the confidentiality of private information. The attackers are morepowerful and are able to correlate multiple data sources and compromise the confidentialityof sensitive data. What is even worse is that enterprises are moving their computation tocloud, which requires extra privacy-preserving techniques.This thesis presents the attempts to address the challenges by detecting, prioritizing andmeasuring security risks from massive datasets. Specifically, it focuses on the detection ofdata leakage in enterprise environments, the prioritization of inter-app communication risksand the measurement of deep link hijacking impact on mobile apps and the web.Data Leakage The exposure of sensitive data is a serious threat to the confidentiality of1

2Chapter 1. Introductionorganizational and personal data. Reports showed that over 800 million sensitive records wereexposed in 2013 through over 2,000 incidents [32]. Reasons include compromised systems,the loss of devices, or unencrypted data storage or network transmission. While many dataleak incidents are due to malicious attacks, a significant portion of the incidents is caused byunintentional mistakes of employees or data owners. In this thesis, I focus on the detectionof accidental/unintentional data leaks from very large datasets.Inter-app Communication Risks Inter-Component Communication (ICC) is a key mechanismfor app-to-app communication in Android, where components of different apps are linkedvia messaging objects (or Intents). While ICC contributes significantly to the developmentof collaborative applications, it also becomes a predominant security attack surface. In thecontext of inter-app communication scenarios, individual apps often suffer from risky vulnerabilities such as Intent hijacking and spoofing, resulting in leaking sensitive user data [53]. Inaddition, ICC allows two or more malicious apps to collude on stealthy attacks that none ofthem could accomplish alone [36, 100]. According to a recent report from McAfee Labs [11],app collusions are increasingly prevalent on mobile platforms. In this thesis, I focus on theprioritization of the large number of ICC risks and detect the high risk apps.Deep Link Hijacking Risks With the wide adoption of smartphones, mobile websites andnative apps have become the two primary interfaces to access online content [12, 115]. Userscan easily launch apps from websites with preloaded context, which becomes instrumental tomany key user experiences. The key enabler of web-to-mobile communication is mobile deeplinks. Like web URLs, mobile deep links are universal resource identifiers (URI) for contentand functions within apps [131]. The most widely used deep link is scheme URL supported byboth Android [7] and iOS [8] since 2008. Despite the convenience, researchers have identifiedserious security vulnerabilities in scheme URLs [42, 53, 138]. The most significant one islink hijacking, where one app can register another app’s scheme and induce the mobile OSto open the wrong app. This allows the malicious apps to perform phishing attacks (e.g.,displaying a fake Facebook login box) or to steal sensitive data carried by the link (e.g.,PII) [53]. To address the problem, Google has designed App Link and Intent URLs . Thethesis measures whether the new mechanisms are helpful in addressing the problems andwhat is the current impact of hijacking on mobile phones and on web.In the rest of the chapter, I briefly introduce the problems and my contributions in detectdata leakage, prioritize inter-app communication risks and measure security impact of deeplink hijacking.1.1Detect Data Leakage in Large DatasetsSeveral approaches have been proposed to detect data exfiltration [24, 29, 30, 122] either ona host or the network. The technique proposed by Shu and Yao [122] performs deep packetinspection to search for exposed outbound traffic that bears high similarity to sensitive data.

1.2. Prioritize Inter-communication Risks in Large-scale Android Apps3Set intersection is used for the similarity measure. This similarity-based detection is versatile,capable of analyzing both text and some binary-encoded context (e.g., Word or .pdf files).However, a naive implementation of the similarity-base approach requires O(nm) complexity,where n and m are sizes of the two sets A and B, respectively. If A and B are both very large(as in my data-leak detection scenario), It would not be practical due to memory limitationand thrashing. Distributing the computation to multiple hosts is nontrivial to implementfrom scratch and has not been reported in the literature.In this thesis, I present a new MapReduce-based system to detect the occurrences of plaintextsensitive data in storage and transmission. The detection is distributed and parallel, capableof screening massive amount of content for exposed information.In addition, I preserve the confidentiality of sensitive data for outsourced detection. In myprivacy-preserving data-leak detection, MapReduce nodes scan content in data storage ornetwork transmission for leaks without learning what the sensitive data is.I implement the algorithms using the open source Hadoop framework. The implementationhas very efficient intermediate data representations, which significantly minimizes the diskand network I/O overhead. I achieved 225 Mbps throughput for the privacy-preserving dataleak detection when processing 74 GB of content.1.2Prioritize Inter-communication Risks in Large-scaleAndroid AppsTo assess the ICC vulnerabilities, various approaches have been proposed to analyze eachindividual app [19, 36, 53, 103, 136]. The advantage of analyzing each app separately is that itcan achieve high scalability. However, it may produce overly conservative risk estimations [53,60, 103]. Manually investigating all the alerts would be a daunting task for security analysts.Recently, researchers start to co-analyze ICCs of two or more apps simultaneously. Thisallows researchers to gain empirical contexts on the actual communications between appsand produce more relevant alerts [22, 90, 118]. However, existing solutions are largely limitedin scale due to the high complexity of pair-wise components analyses (O(N 2 )). They wereeither applied to a much smaller set of apps or small sets of inter-app links.In this thesis, I present MR-Droid, a MapReduce-based parallel analytics system for accurateand scalable ICC risk detection. I empirically evaluate ICC risks based on an app’s interconnections with other real-world apps, and detect high-risk pairs.To construct the communication context of each app, I construct a large-scale ICC graph,where each node is an app component and the edge represents the corresponding inter-appICCs. To gauge the risk level of the ICC pairs (edge weight), I extract various features basedon app flow analyses that indicate vulnerabilities. With the ICC graph, I then rank the risk

4Chapter 1. Introductionlevel of a given app by aggregating all its ICC edges connected to other apps.The system is implemented with a set of new MapReduce [57] algorithms atop the Hadoopframework. The high-level parallelization from MapReduce allows us to analyze millions ofapp pairs within hours using commodity servers. My evaluation of 11,996 apps demonstratesthe scalability of MR-Droid. The entire process took less than 25 hours for an average ofonly 0.0012 seconds per app pair. My prioritization result achieves over 90% of true positivein detecting high risk apps.1.3Measure Hijacking Risks of Mobile Deep LinksScheme URLs are known to be vulnerable to various attacks [42, 53, 138]. The most significant one is link hijacking Recently, two new

massive datasets. On the other hand, it is becoming more and more di cult for existing tools to handle massive datasets with various data types. This thesis presents the attempts to address the challenges and solve di erent security prob-lems by mining security risks from massive datasets. The attempts are in three aspects: