Cybersecurity Data Science

Transcription

Best Practices from the FieldScott Allen Mongeauscott.mongeau@sas.comCybersecurity Data Scientist – SAS InstituteLecturer / PhD candidate – Nyenrode Business University@SARK7 #CSDS2020 #FloCon19Copyright 2019 Scott Mongeau All rights reserved.This presentation is non-commercial, for educational purposes onlyAll opinions herein are my own, not those of my employerNo rights claimed or implied to 3rd party materialsCybersecurity Data Science

INTRODUCTIONCybersecurity Data Science practitioner – SAS Institute Lecturer / PhD candidate – Nyenrode Business University Qualitative research 43 global cybersecurity data scientistsKey challenges and best practicesOrganizational & methodological guidanceBook early 2020 #CSDS2020‘Cybersecurity Data Science: Prescribed Best Practices’2

Research Motivation:Genesis in Six MemesCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

Three Year Genesis of This TalkFloCon 2017 – San Diego Interest in data analytics percolates But cautious: ‘I’ll know it when I see it’Labeled for non-commercial reuse4

2017: “THE CAUTIOUS TRADITIONALISTS”5

Three Year Genesis of This TalkFloCon 2017 – San Diego Interest in data analytics percolates But cautious: ‘I’ll know it when I see it’FloCon 2018 – Tucson Spike in analytics and machine learning cases But questions emerge: ‘How do we get from here to there?’6Labeled for non-commercial reuse Wikipedia Commons

2018: “THE DATA REVOUTIONARIES”7

2018: SAY ‘DATA SCIENCE’ ONE MORE TIME!8

Three Year Genesis of This TalkFloCon 2017 – San Diego Interest in analytics percolates But : ‘I’ll know it when I see it’FloCon 2018 – Tucson Spike in analytics and ML cases But : ‘How do we get there?’FloCon 2019 – New Orleans Deafening market / vendor buzz But, caveats abound: ‘Many are drowning in data lakes’Labeled for non-commercial reuse Wikipedia Commons9

2019: Drowning in Data Lakes10

2019: ONE DOES NOT SIMPLY “PUSH A DEEP LEARNING MODEL TO PRODUCTION”11

2019But substantialissuesgrow12

132019: Reactive militarization

2019CSDSCyberSecurityDataScienceDATA SCIENCEMETHODSCYBERSECURITYGOALS14

2019CSDSCyberSecurityDataScience Taking stock“Data scientists and practitioners cantalk past each other.”Rapid emergenceEarly stages of professionalizationAffected by maturity of ‘data science’ more generally15

HIGHData Science in 30 Seconds ngFactors & CausesDESCRIPTIVEDATA ENGINEEERINGNetwork Context& MeaningDIAGNOSTICForecasting &ProbabilitiesBusinessIntelligenceLOWOVERHEADSee YouTube lectures: https://bit.ly/SS9rCTHIGH

CSDS Interview ResearchWhat Type of Data Science is CSDS?Copy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

Participants - Sample43 participants 130 years collective CSDS experience (3 yr mean) Linked-In search ‘cybersecurity’ (‘data scientist’ or ‘analytics’) 350 professionals globallyDirect outreach Follow-on referrals Gating to exclude ‘ceremonial CSDS’ i.e. sales, recruiting, marketing, technology strategists18

Current RegionNorth AmericaWestern EuropeAsia / PacificEastern EuropeMiddle EastSouth AmericaTotaln%27 63%10 23%25%25%12%12%43 100%25% (n 11) relocated from native region19% (n 8) relocated to US specifically12% (n 5) relocated from Asia to USDemographic Profile (n 43)GenderCurrent IndustrySoftware & ServicesConsultingFinance/Svcs/InsGovernment / militaryConsumer productsAcademics / leFemalen38543%88%12%100%

Demographic Profile (n 43)Age*MeanStdDev# Yrs Employed*379MeanStdDev1510* Estimates inferred from LinkedIn profile data# Yrs CSDS*MeanStdDev33

Interview Questions and Analysis43 Cybersecurity Data Scientists (Dis-)Agree Copy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

CSDS Practitioner Interview ResearchQualitative: Open Response 30 Minute Interviews ENTRY:How did you become involved in domain? What TRENDS are emerging? What are perceived central CHALLENGES? What are key BEST PRACTICES? METHODS: Borrowing from adjacent domains? THREATS:Trends on the adversarial side?22

Methodology: Interview Topic Labeling (CODING)Inductive Extrapolation and Deductive Refinement scientist,science, activity, data scientist,cyber instance, positive,false, false positive, obtain behavior, anomaly,detection, attack,falseright, risk, day, case, aspectmachine,machine learning,learning, industry,mlquality, process, process,collection,data qualitycyber security, tool, little, hard,malicious tool, integrate,job, user,knowledgeTopic extraction training industry 'machine learning' apply pretty 'data science' marketanalysis ml area machine algorithm domain defense 'as well' behavior false anomaly positive 'as well' event 'false positive'detection point well important solution automate learning labelText analytics processing instance 'false positive' allow depend extract obtain amount 'different thing' add deal positive collect mention false information integrate 'cyber security' trend approach cyber better business field depend large know good machine hard scientistcybersecurity definitely address increase automate complexity defense industry mention threat attacker issue right device tool'big data' privacy implement process decision technique big quality algorithm bring solve difficult method year apply buy day money long aspect source network especially case right area start bring cybersecurity bigEngine: SAS Contextual AnalysisNatural Language Processing (NLP)Latent Semantic Indexing (LSI)Singular Value Decomposition (SVD)Concept clusteringAgglomerative multi-docDivisive unique docContent analytics extrapolated themesPractitioner reviewDomain literature reviewKey topics (codes)23‘Coding’ of processedinterview transcripts

CSDS Objectives - Conceptual Model for ResponsesFraming and Relationships Amongst LLENGES24

Threats & Adjacent DomainsCSDS Professional PerspectivesCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

Internal threatsTHREATS: 13 Adversarial TrendsInherent vulnerabilitiesReverse engineering detectionWhite hat tools (i.e. PEN testing) often quickly endup being repurposed for black hat purposes Automated attacks increasingExploiting new tech vectorsAdversarial objectivesevolve to optimizeeconomic risk-rewardSocial engineeringRansomware-as-a-serviceMuch disagreement, fromindignant disbelief to notion ofmanifest destinyCrypto-jackingContinual adaptationState actors machine learningTime-to-detection / dwell timeIndustry-specific attacksAdversarial MLi.e. Reverse engineering and confusing / tricking MLmodels (seeding false data) Although a ‘hot topic’ inacademic research, few indications of incidents.26

METHODS: 8 Influential Adjacent DomainsSocial & behavioral sciencesFraud / forensics / criminologyMedical, epidemiological, ecologicalEnterprise risk managementQUOTE: “It is almost a crime how little we learnfrom the fraud domain being as they have been at itfor almost a century.”QUOTE: “As networks and devices becomeincreasingly complex and intertwined, they begin toresemble organic systems and act in biological ways.”QUOTE: “Whereas cybersecurity seeksto safeguard, it isn’t going to get very farwithout quantifying risks and impacts.”Network graph analyticsNLP & semantic engineeringForecasting / time-series analysisComputer vision / deep learningQUOTE: “Still a work in progress, and one does needto step over the hype, but there are some earlyindications that deep learning can be quite efficaciousif one is handling immense amounts of labeled data.”27

CHALLENGESPerceived CSDS GapsCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

ORGANIZATIONConfusionMarketing hypeRegulatoryuncertaintyFew resourcesChallenges:12 TopicsPROCESSInherent costsDecision uncertaintyFalse alerts volumeScientific process?TECHNOLOGYData preparation /qualityOwn infrastructure& shadow IT?Normal vs.anomalous?Lack of labeledincidents29

Challenges: 12 Topics 5 Themes** Utilizing exploratory factor analysis (extraction of latent factors)1. Leadership has ‘lost the plot’ Uncertainty: nature of threats, what is being protected, how to react2. Can’t do it all! Expansive domain: not cost effective to cover everything in house3. Between a rock and a hard place Rules-based approaches failing, but alternate approaches overhyped4. Scientific contextualists Need to improve representation of environment & tracking of events5. Data cleansing: ‘the ugly stepchild’ Critical underinvestment in data engineering to stage analytics30

Best PracticesPerceived CSDS TreatmentsCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

Best Practices: 26 Topics 8 Themes** Utilizing exploratory factor analysis (extraction of latent factors) ORGANIZATIONManagement-driven changeTraining & program governance PROCESSOrganizational process engineeringStructured risk quantificationFocused scientific processes32 TECHNOLOGYData engineering practices Ontologies & normalizationArchitecture-driven solutions

Key GuidanceCSDS Gap PrescriptionsCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

Key Prescribed Treatments: Correlation Between FactorsChallenge ThemesBest Practice Themes1. Leadership has ‘lost the plot’ Management-driven changeTraining & program governance2. Can’t do it all! Organizational process engineeringFocused scientific processes3. Between a rock and a hard place Architecture-driven solutionsOntologies & normalization4. Scientific contextualists Training & program governanceData engineering practices5. Data cleansing: ‘the ugly stepchild’ (limits of rules vs. hype)34Management-driven changeTraining & program governanceStructured risk quantificationFocused scientific processesData engineering practicesOntologies & normalization

Organization: Interdisciplinary CollaborationData EngineeringAdvanced AnalyticsDiagnostics & omalydetectionTriage / ValidateRemediate?BehavioralinsightsCYBER RISK ANALYTICS istCyberCASEMGMTInvestigatorRECURSIVE FEEDBACKINVESTIGATORInfosec Response

Organization: Interdisciplinary Collaboration Collaborate in process reengineering Collaborate in establishingmodel context Admit limits of signatures Decision & ownership clarity Training & team building Orchestrate cross-functionalcollaboration (incentives) Call “AI automation” s Architect exploration anddetection processes Collaborative model building Model transparency De-escalate “AI hype cycle” Core data ‘pipeline’ processing Facilitate processes / quality Call “data lake strategy” bluff36

People - Process - TechnologyManagement of Information SystemCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

People: Anomaly Detection - Simply ComplexIdentifying targeted anomalies amongst an ocean of noise PROBLEMFRAMINGEVALUATE &MONITOR SDATA EXPLORATIONMODEL VALIDATIONSOURCEAggarwal, Charu C. (2017). “Outlier Analysis: SecondEdition”. Springer International Publishing AG.TRANSFORM &SELECTMODELBUILDING

Process: Analytics Life CycleRaw DataFeatureSelectionFeatureModelingFeatures EngineeringInsightsSAS: ‘Managing the Analytics Life Cycle for Decisions at Scale’

Technology: Architect Exploratory & Detection Platforms*Functional Architectural SegmentationExploratory ‘bigdata’ repositoryOperationallyfocused detectionFeature engineeringCanonical ontology /schemasi.e. selection, refinement,binning, correlationsAnalytical models Descriptive Unsupervised Analytical tAnalytical models Statistical Supervised* Runs counter to the industry vendor stance of store ‘all-the-data-all-the-time’40

SummaryCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

Cybersecurity Data Science (CSDS) Process of Professionalization: Named professionalsSet of methods and techniquesStandards, best practicesTraining programsCertificationsAcademic degree programsFocused research journalsFormal sub-specializationa work in progressSpecialistResearcher Primary CareSurgeonDiagnostician Emergency Care42

Thank You!Interested to participate?Scott MongeauCybersecurityData Scientist 31 68 370 3097(Netherlands GMT 1)scott.mongeau@sas.com43

REFERENCESCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

REFERENCES Aggarwal, C. (2013). “Outlier Analysis.” Springer. http://www.springer.com/la/book/9781461463955 Harris, H., Murphy, S., and Vaisman, M. (2013). “Analyzing the Analyzers.” O’Reilly Media. Available analyzers.csp Kirchhoff, C., Upton, D., and Winnefeld, Jr., Admiral J. A. (2015 October 7). “Defending Your Networks: Lessons from the Pentagon.”Harvard Business Review. Available at works-lessons-from-the-pentagon Mongeau, S. (2018). “Cybersecurity Data Science (CSDS).” SCTR7.com. / Mongeau, S. (2017). “Cybersecurity Big Data Overload?” SCTR7.com. a-overload/ Ponemon Institute. (2017). “When Seconds Count: How Security Analytics Improves Cybersecurity Defenses.” Available athttps://www.sas.com/en oves-cybersecurity-defenses-108679.html SANS Institute. (2015). “2015 Analytics and Intelligence Survey.” Available at https://www.sas.com/en 08031.html SANS Institute. (2016). “Using Analytics to Predict Future Attacks and Breaches.” Available at https://www.sas.com/en us/whitepapers/sans- SAS Institute. (2016). “Managing the Analytical Life Cycle for Decisions at Scale.” Available at SAS Institute. (2017). “SAS Cybersecurity: Counter cyberattacks with your information advantage.” Available at UBM. (2016). “Dark Reading: Close the Detection Deficit with Security Analytics.” Available at https://www.sas.com/en om/content/dam/SAS/en en analytics-108280.html45

APPENDIXCopy rig ht 2019 Scott Mong e a u A ll rig hts re se rve d.

Organization: Building Disciplinary Bridges Growing pressure/urgency Structured processes Cyber general enterprise riskMeshing discovery, model building/validation, alerting/remediationData engineering as a processDiscovery / exploration Detection / remediation 47

Key Prescribed Treatments: Correlation Between FactorsChallenge Themes (Factors)Best Practice Themes (Factors)1. Leadership has ‘lost the plot’ Management-driven changeTraining & program governance2. Can’t do it all! Organizational process engineeringFocused scientific processes3. Between a rock and a hard place Architecture-driven solutionsSemantic frameworks4. Scientific contextualists Training & program governanceData engineering practices5. Data cleansing: ‘the ugly stepchild’ (limits of rules vs. hype)48Management-driven changeTraining & program governanceStructured risk quantificationFocused scientific processesData engineering practicesSemantic frameworks

Process: Machine Learning Segmentation versus why-it-matters-1255b182fc649

Cybersecurity Analytics Maturity ModelAnomaly DetectionData-awareInvestigationsPredictive DetectionRisk Awareness /Resource Optimization Big data overload Flags, rules, and alertsUnderstandingLearningRisk OptimalChasingphantompatterns Featureengineering Unsupervised ML Labeling Diagnostics Human-in-the-loopreinforcementlearning Semi- andSupervised ML Championchallenger modelmanagement Automating alerttriage Resourceoptimization

Cyber Defense Economics: Optimizing Accessibility Versus ExposureInvest to point of optimalityunderinvested(P)Profitsfrom reakevenProfitsCosts(Q) Quantity of cyber threat assurance51SOURCEPartnering for Cyber Resilience: Towards the Quantification of Cyber ThreatsWEF report in collaboration with Deloitte:http://www3.weforum.org/docs/WEFUSA QuantificationofCyberThreats Report2015.pdf

The ‘Meta Picture’ for Technologists and MethodologistsCybersecurity: hybrid techno-economicbehavioral context many latent variables Research methodology Multivariate inferential statisticsSocial science: grounded theory (inductive)Cross-applicability to ‘core’ cybersecurity?e.g. Increase in complex multi-domain models?Extrapolating & validating patterns Content analysis / text analytics Cluster Analysis Principal Component Analysis (PCA) Discriminant Analysis Factor Analysis* latent factors Correspondence Analysis Structural equation modeling (SEM)52 Extrapolating latent behavioral indicators i.e. User IT ‘technical sophistication’ ‘Organizational importance’ of a device ‘Adversarial determination’Validating theoretical models

'big data' privacy implement process decision technique big quality algorithm bring solve difficult method year apply buy day money long aspect source network especially case right