2016 IEEE/ACM International Conference On Advances In .

Transcription

2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)CyberTwitter: Using Twitter to generate alerts forCybersecurity Threats and VulnerabilitiesSudip Mittal , Prajit Kumar Das , Varish Mulwad† , Anupam Joshi , Tim Finin University of Maryland, Baltimore County, Baltimore, MD, USA{smittal1, prajit1, joshi, finin}@umbc.edu† GE Global Researchvarish.mulwad@ge.comAbstract—In order to secure vital personal and organizationalsystem we require timely intelligence on cybersecurity threatsand vulnerabilities. Intelligence about these threats is generallyavailable in both overt and covert sources like the National Vulnerability Database, CERT alerts, blog posts, social media, anddark web resources. Intelligence updates about cybersecurity canbe viewed as temporal events that a security analyst must keep upwith so as to secure a computer system. We describe CyberTwitter,a system to discover and analyze cybersecurity intelligence onTwitter and serve as a OSINT (Open–source intelligence) source.We analyze real time information updates, in form of tweets, toextract intelligence about various possible threats. We use theSemantic Web RDF to represent the intelligence gathered andSWRL rules to reason over extracted intelligence to issue alertsfor security analysts.I.I NTRODUCTIONIn the broad domain of security, analysts and policy makersneed knowledge about the state of the world to make timelycritical decisions, operational/tactical as well as strategic. Thisknowledge has to be extracted from a variety of differentsources, and then represented in a form that will enable furtheranalysis and decision making. Some of the data underlying thisknowledge is in textual sources traditionally associated withOpen–source Intelligence (OSINT) [1].OSINT is intelligence gathered from publicly-availableovert sources such as newspapers, magazines, socialnetworking sites, video sharing sites, wikis, blogs, etc. In cybersecurity domain, information available through OSINT cancompliment data obtained through traditional security systemsand monitoring tools like Intrusion Detection and PreventionSystems (IDPS) [2], [3]. Cybersecurity information sourcescan be divided into two abstract groups, formal sources such asNIST’s National Vulnerability Database (NVD), United StatesComputer Emergency Readiness Team (US-CERT), etc. andvarious informal sources such as blogs, developer forums,chat rooms and social media platforms like Twitter1 , Reddit2and Stackoverflow, these provide information related to security vulnerabilities, threats and attacks. A lot of informationis published on these sources on a daily basis making itnearly impossible for a human analyst to manually comb1 https://twitter.com/hashtag/cybersecurity?lang en2 https://www.reddit.com/r/cybersecurity/IEEE/ACM ASONAM 2016, August 18-21, 2016, San Francisco, CA, USA978-1-5090-2846-7/16/ 31.00 2016 IEEEthrough, extract relevant information, and then understandvarious contextual scenarios in which an attack might takeplace. A manual approach even with a large number of humananalysts would neither be efficient nor scalable. Automaticallyextracting relevant information from OSINT sources thus hasreceived attention from the research community [4]–[6].The real time nature of information on Twitter has allowedresearchers to provide significant insights during ‘high impact events’. It has been used to analyze emergencies likeearthquakes [7], forest fires [8], terrorist attacks [9], naturalhazards [10] and so on. Such applications and analysis ofTwitter data have solidified its reputation as an importantOSINT source. On Twitter, several organizations such asAdobe(@AdobeSecurity), Github (@githubstatus), WhatsApp(@wa status) report on security incidents related to theirproducts. Individual users, often ethical hackers, also reportabout newly discovered vulnerabilities via Twitter (Figure 1).Such updates can form viable intelligence inputs for humananalysts to protect their systems. Detection and updates tovarious threats and vulnerabilities can be considered as cybersecurity events that impact computer systems. Hence, webelieve Twitter can be an effective source to gather informationabout cybersecurity threats.In this paper, we present CyberTwitter, a framework toanalyze tweets about cybersecurity and to issue timely threatalerts to security analysts based on an organization’s ‘systemprofile’. Alerts generated by CyberTwitter can then serve asan input to various other security systems who can use themdepending upon local organizational security policies. Onesuch system is [2].In our system, we begin by collecting Twitter data. Inthe collected tweets we identify, tag and extract various realworld conceptual entities related to cybersecurity vulnerabilities such as means of an attack, consequences of an attack,affected software, hardware, vendors, etc. using a SecurityVulnerability Concept Extractor (SVCE) [11]. Concepts andentities extracted by SVCE are then linked to existing conceptsand entities present in external, publicly available semanticknowledge bases, to further enrich our extracted data. In oursystem, this information is represented as a set of RDF triplesin a semantic knowledge base. We allow analysts to describea system profile which captures information about installedsoftware and / or hardware. We develop an intelligence ontology and use it along with SWRL rules to address timesensitive nature of cybersecurity events. CyberTwitter performs

2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)reasoning using this system profile, data in the knowledgebase and varying time slices to generate the most relevantand important alerts for human review. Given, the sometimes,unreliable nature of information on Twitter [12] along withthe possibility of different local security and organizationalpolicies, we believe that it’s best for a human analyst to be ‘inloop’ with the system.Fig. 2: Sample tweet from an individual user about a recentsecurity vulnerabilityFig. 1: Both users and organizations use Twitter to reportpotential threats.II.R ELATED W ORKA. Twitter as an OSINT sourceThe nature of intelligence in any security system is thatit has a temporal dimension. A piece of information can beconsidered important at a given time and useless at someother. In our system we define “Intelligence” as an actionableinformation for a human analyst which makes them awareabout a new threat or vulnerability in a software / hardware thatthey are interested in. In our system we analyze informationabout these temporal cybersecurity events as they appear onTwitter. For example, in 2015, a 72 hour long DDoS attackon Github was live tweeted by their status account and itserved as the go to source of information for the generalpublic [13]. A combination of real time tweets from affectedusers, ethical hackers, as well organizational accounts canallow both analysts as well as systems to infer a pattern ora larger ongoing attack and generate alerts to provide rapidresponse. Relevancy of these alerts will also depend on thetimestamps of tweets given the highly temporal nature ofcyberattacks. Information relevant in one time window is notnecessarily relevant in another. Human analysts and policymakers will not only require relevant alerts from their systemperspective, but also relevant to a particular time window. Forexample, any alerts related to the 2015 Github DDOS attackwould have been valid in that particular time frame. Thus,any such system would have to make this temporal nature animportant aspect of its design.Over the past decade, Twitter has become a vital source foropen source intelligence. The social media site’s data has beenused by researchers to gather intelligence about the impactof natural disasters [7] [8], terrorists attacks [9], governmentelections [14], predicting stock markets [15], etc. In our work,we are interested in using Twitter as a source of informationto study various cybersecurity events. Twitter users as in whennew vulnerabilities are made public, tweet about these vulnerabilities (Figure 1 and 2) to spread information on the networkso that others can use that particular information to securetheir systems. Individuals or reputed security experts like BrianKrebs (an investigative journalist who writes about cybercrime) can be valuable resources for cybersecurity incidents.Established companies like @web security or @intersecww ordisseminate news, tips and latest information on web security,web application protection, hacker incidents, data breaches,penetration testing results, etc. Other organization specificaccounts like @githubstatus, @FirebaseStatus, @herokustatus, @stripestatus, @DOStatus (DigitalOcean), @redditstatus,@twitchstatus, @AdobeSecurity, @JuniperSIRT etc. report onsecurity incidents with respect to their platforms and products.For obvious reasons such organizational accounts mentionedabove are valuable sources of information with respect tocybersecurity events. We wish to use these Twitter updatesto mine intelligence about various cybersecurity threats andvulnerabilities, Section III gives details about our system.Rest of the paper is organized as follows– We discussrelated work in Section II, our CyberTwitter framework inSection III. We execute and evaluate our system in SectionIV & V respectively. We present our conclusions and futurework in Section VI.B. Text Analysis for CybersecurityThe use of semantic knowledge bases (KB) in cybersecurityhas gained traction in the past few years. Considerable atten2

2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)SoftwareUbuntuAdobe FlashJavaChromium Browser / Google ChromeFirefoxAdobe Flash Player (Chromium)tion has been dedicated to develop techniques for extractingconcepts related to security vulnerabilities, affected software,hardware, and organizations and generating its semantic representation [16] [17] [18] [19] [20] [21]. While previous researchfocused on sources such as NVD and security blogs, our workis applied to Twitter where the content is different from 11245.0.221.0.0.216-r1TABLE I: User System Profile.B. Tweet CollectionC. Knowledge Base systems for CybersecurityCyberTwitter collects data through the Twitter Stream API3based on a set of keywords. These keywords are derived fromthe “User System Profile” and a list of cybersecurity terms(see Figure 4). For our system we limit ourselves to tweets inEnglish language4 . After collecting a good number of tweetswe clean the data using WordNet, which is a large lexicaldatabase for English [24].Research has also focused on integrating data from traditional monitoring and security tools with such KBs andreasoning over it for early threat detection and prevention [3][22]. However, previous work has relied on cybersecurityKBs generated from NVDs and blogs that are often updatedpost facto i.e. after the threat or the vulnerability has beenknown for some time and has been vetted by various securityprofessionals and analysts, whereas CyberTwitter generatespersonalized alerts using a KB that is updated in real timebased on a ‘user’s system profile’.III.TypeOperating SystemSoftwareSoftwareBrowserBrowserExtentionC YBERT WITTER F RAMEWORKWe develop CyberTwitter, a framework to automaticallyissue cybersecurity vulnerability alerts to users (Figure 3).CyberTwitter begins by collecting relevant tweets by queryingthe Twitter API. The tweet Collection module collects, cleansand stores tweets returned by the API. Every tweet is furtherprocessed by the Security Vulnerability Concept Extractor(SVCE) [11] which extracts various terms and concepts relatedto security vulnerabilities. Intelligence from these terms andconcepts is then converted to RDF statements using ourintelligence ontology. We use UCO ontology (Unified Cybersecurity Ontology) [20] to provide our system with cybersecuritydomain information. RDF Linked Data representation is storedin our “Cybersecurity Knowledge Base” allowing our alertsystem to reason over the data. Finally we issue alerts to theend user based on a “User System Profile”. We will furtherexplain various details and sub-modules present in our systemin the next few subsections.Fig. 4: Data collection keywords.C. Security Vulnerability Concept ExtractorThe Security Vulnerability Concept Extractor (SVCE) consists of a custom Named Entity Recognizer (NER) [11] whichextracts terms related to security vulnerabilities. The NER wastrained using text from security blogs, Common Vulnerabilitiesand Exposures (CVE) descriptions and official security bulletins from Microsoft and Adobe. It tags every sentence withthe following concepts: Means of an attack, Consequence ofan attack, affected software, hardware and operating system,version numbers, network related terms, file names and othertechnical terms.Our system can be divided into two major parts. Thefirst is a dynamically populated “Cybersecurity KnowledgeBase” that contains information about cybersecurity threats andvulnerabilities. The second is an alert system that issues alertsto the end user based on their “User System Profile” using the“Cybersecurity Knowledge Base”.The use of the custom NER provides us multiple advantages. SVCE discards all tweets for which the NER fails toidentify even a single concept, thus further cleaning up thedata. The extracted concepts are also used to generate an RDFLinked Data representation for every tweet that maybe queriedby security systems to protect against potential attacks.A. User System ProfileWe obtain information about the user’s system and store itin the “User System Profile” file. The profile contains information about the operating system, various installed softwaresand their version information. We use the profile informationas part of our rules. The system information is converted intoSWRL rules [23] (see Section III-F), that allows us to reasonover them and generate cybersecurity alerts. A sample profile“User System Profile” is shown in Table I.D. Filtering and Cleaning DataIn our “Cybersecurity Knowledge Base” we wanted to storehighly relevant tweets only. We filter tweets out based on3 https://dev.twitter.com/docs/streaming-api4 -parameters#language3

2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)UserOSystemOProfile(Rules)Twitter temACybersecurityKnowledgeBaseFig. 3: CyberTwitter: A framework for monitoring and analyzing tweets related to cyber attacks.that represent real world as concepts. These concepts arethen associated with Uniform Resource Identifiers (URIs) [25].For example, the string “Apple” can be associated with thecompany Apple Inc. or the fruit apple. Also, these conceptscan have various attributes and relations to other concepts.An entity ‘Apple’ can have an attribute ‘type’ with a value‘organization’ or ‘plant’. These attributes are vital so as todifferentiate between two completely different concepts havingsame spellings.Example tweet:ASUS wireless router updates are vulnerableto a MITM attack l?id 20071SVCE Output:[[(u'ASUS', u'PRODUCT,'),(u'wireless', u'OTHER,'),(u'router', u'OTHER,'),(u'updates', u'O'),(u'are', u'O'),(u'vulnerable', u'O'),(u'to', u'O'),(u'a', u'O'),(u'MITM', u'ATTACK,'),(u'attack', -details.html?id 20071',u'O')]]For an intelligent system like CyberTwitter, it is vital tounderstand the difference between various real world conceptsand also to posses a comprehensive knowledge about thecybersecurity domain. In this paper we use various publiclyavailable cybersecurity ontologies and knowledge bases to support information integration and cyber-situational awareness:1)2)Fig. 5: Labelled output generated by the SecurityVulnerability Concept Extractor (SVCE).3)the output of our Security Vulnerability Concept Extractor(SVCE). In our system we only keep those tweets whichcontain two or more tags as generated by our SVCE. Sucha threshold helps us realize the goal of including only highlyrelevant tweets in our knowledge base. We discuss the impactof this threshold in Section V.UCO: Unified Cybersecurity Ontology [20]: The ontology integrates heterogeneous data and knowledgeschemas from different cybersecurity systems andstandards.DBpedia [26]: DBpedia is a project to extract structured content from the information created as part ofthe Wikipedia project5 .YAGO (Yet Another Great Ontology) [27]: It isa knowledge base automatically extracted fromWikipedia and other sources.We have used UCO to provide our system with knowledgeabout the cybersecurity domain. We use DBpedia and YAGOto link the output generated by our Security Vulnerability Concept Extractor (SVCE) to real world concepts. Entity matchingprocess is performed by using DBpedia6 [26] and YAGO7 APIswith the MaxHits parameter set as 1. For example we can useDBpedia to map the string “Adobe Flash” to dbr:Adobe Flash8. Both these external knowledge bases help us map stringentities to real world conceptual instances. The output fromE. Cybersecurity Ontologies and Knowledge BasesA data feed sent through the Twitter Stream API essentiallyconsists of a stream of strings that computers can process.However, in the real world, strings represent terms and concepts that may sometimes be ambiguous and computers arenot programmed to handle ambiguity. Computer systems canbe aided in this task by various Semantic Web technologies5 https://wikimediafoundation.org/wiki/Ourprojects6 ight7 https://github.com/yago-naga/aida8 http://dbpedia.org/page/Adobe4Flash

2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM)the SVCE module enlists various cybersecurity related entitiesin textual tweets like, Means of an attack, Consequence ofan attack, affected software, etc. We use UCO, DBpedia andYAGO to link these entities to real world concepts. After entitylinking we store the linked data as RDF triples [28] in our“Cybersecurity Knowledge Base”.Base’ is vital for our system as it provides us with a set of rulesand information in form of triples on which we can reason soas to issue vulnerability and threat alerts to the user. The enduser can also be given access to the Knowledge Base whichthey can query using a SPARQL interface [29] which is quitesimilar to SQL.In our CyberTwitter system we need information of cybersecurity events. Events are temporal in nature. UCO thoughgives us a domain overview of cybersecurity it cannot handletemporal nature of events. So as to handle time in events wecreate an Intelligence ontology.F. Cybersecurity AlertsIn the final module of CyberTwitter we generate and issuealerts using the cybersecurity knowledge base and the usersystem profile. After creating the knowledge base we need anintelligent system to reason over various RDF statements andevaluate if the system should raise an alert to inform the userabout a potential threat or vulnerability that may exist.In our system we define ‘Intelligence’ as an actionableinformation for the human analyst which makes them awareabout a new threat or vulnerability in a software / hardware thatthey list in their user system profile. The nature of intelligencein any security system is that it has a temporal dimension. Apiece of information can be considered as vital information ata given time and useless at some other instance of time. So toincorporate time we included the following properties in theontology:1)2)3)4)5)6)After creating the cybersecurity knowledge base we includevarious SWRL rules [23] to our system. SWRL rules containtwo parts, antecedent part (bo

issue cybersecurity vulnerability alerts to users (Figure 3). CyberTwitter begins by collecting relevant tweets by querying the Twitter API. The tweet Collection module collects, cleans and stores tweets returned by the API. Every tweet is further