ESnet WAN Security Updates

Transcription

ESnet WAN Security UpdatesScott CampbellESnet Network Security2018 Technology ExchangeOctober 16, 2018

Intro to ESnet and TeamOur MissionESnet provides the high-bandwidth, reliable connections that link scientists at national laboratories, universities, and other researchinstitutions, enabling them to collaborate on some of the world's most important scientific challenges including energy, climate science, andthe origins of the universe. Funded by the DOE Office of Science, ESnet is managed and operated by the Scientific Networking Division atLawrence Berkeley National Laboratory. As a nationwide infrastructure and DOE User Facility, ESnet provides scientists with access to uniqueDOE research facilities and computing resources.ESnet has a strong ethos of leadership in networking innovation. Our team is constantly improving the services to create a more versatile androbust network to serve the emerging needs of scientific researchers. We work closely with the international technical community to developopen-source software and collaborative technical projects. Our programs, as well as our collaborations with other research and educationnetworks around the world, demonstrate our commitment to push network research forward so we will be able to anticipate and flawlesslyserve the needs of DOE scientists.As computational tools such as simulation and visualization become more fine-tuned and sophisticated, they also require the processing andmanagement of greater amounts of data. The sheer volume of data generated by international scientific collaborations and the increasinglydistributed nature of large data flows, as well as the development of real time cloud computing, is creating a staggering volume of networktraffic. Since 1990, ESnet's average traffic has grown by a factor of 10 every 47 months. This growth trend is only accelerating as internationalscientific collaborations such as the Large Hadron Collider (LHC), Coupled Model Intercomparison Project (CMIP), and InternationalThermonuclear Experimental Reactor (ITER), generate and exchange massive amounts of data. ESnet provides a scalable and efficientnetwork infrastructure that enables scientists to optimize their resources and conduct ambitious, world-class research.ESnet is supervised by the Office of Science, which oversees 10 national laboratories that carry out the missions of its science programs, aswell as underwrites research and development projects conducted at additional laboratories overseen by other DOE offices.

Agenda Introduction The larger problem Our Approach Security AssistBro on the WANCEASE(D)DOS Futures ESnet6

Statement of ProblemPrologue: The foreboding bit in the beginning of a movie thatprovides enough backstory for the plot to make sense .It has come to our realization that our ability to observe andrespond to attacks in/on our control plane network was limited.As well, defining what exactly constitutes normal in a long termperspective for our sites and peering points was not wellorganized. This is not about getting more data, just having theinfrastructure in place to digest this data and provide an ideaabout what has changed off norm.

Our ApproachFor modern scientific research at scale, we see the Network asan indispensable tool/instrument for critical data andcommunications which means that it needs to be secure,performant and reliable. Most of this work is experimental: iterative design changesas we use what we learn from stage 1 to stage 2 and beyond. Listen to the data - informs project direction. Be prepared to junk mental models. Complexity in design is bad. Software discipline is good - testing and automation is asmall investment in sanity and longevity.

Agenda Introduction The larger problem Our Approach Security AssistBro on the WANCEASE(D)DOS Futures ESnet6

Security Assist Security-Assist will be availableto anyone connected to ESnet Designed to augment, notreplace, Site security It is most important that Sitesremain in full control of theirSite security

Security Assist : PurposeThe Security Assist project is an effort to take advantage of the the unique perspectiveprovided by ESnet as a service provider to: Provide tools to remediate (upstream) trafficLeverage our perspective to give information otherwise too costly or complexto obtainAssist smaller sites constrained by personnel or budget issues in a finite welldefined wayThe purpose expressly is not to interfere with local sites security policy, issues orpractices.The tools and data we are providing/will provide is evolving as we better understandthe technology, ecosystem and needs of both the end sites and the underlyingnetwork.

Security Assist : BGP Monitoring and Alerting(BGPmon)XBGP Community triggered BHRWAN SME AnalysisXXSite Black Hole Routing (BHR)XExternal Vulnerability ScannerXInfrastructure Filtering (againstESnet)Upstream Peer FilteringAutoXXShare Custom Bro PoliciesXFiltering (5-tuple)XFull Packet Capture On-DemandX

Security Assist : TimelineServicePhase 1BGP Monitoring and Alerting (BGPmon)BGP Community triggered BHRWAN SME AnalysisSite Black Hole Routing (BHR)External Vulnerability ScannerPhase 2Infrastructure Filtering (against ESnet)Upstream Peer FilteringShare Custom Bro PoliciesPhase 3Filtering (5-tuple)Full Packet Capture On-Demand

Agenda Introduction The larger problem Our Approach Security AssistBro on the WANCEASE(D)DOS Futures ESnet6

BackgroundESnet has loads of metrics, logs, and sampled netflow data across the entireWAN.Goals: Increase visibility into the traffic traversing our WAN and our controlplane via deep packet inspection Be better prepared to assist Site security teams by understanding thekinds of attacks they see.Implementation: Due to space and budget limitations, it's unrealistic to install amulti-100G Bro cluster at several WAN locations. Be particular about filtering traffic to Bro that has the highest likelihoodof being interesting

Phase I: Initial Data CollectionUse spanning port to forward traffic destined to thestub network on each of the routers used forcollection.Logs toSplunkESnetHumansBOWACL/Port MirrorR2R1StubNetworkPeeringTraffic

Single Instance Architecture

Current Pilot Deployment**Plus LBL Datacenter

Multiple Instance Architecture

Potential Future Architecture

Early Success & Lessons Learned Success– Control plane traffic really is generally as boring as it should be Anomalies should be easy to spot– DoS reconnaissance traffic– Configuration mistakes Including connection attempts to hosts that no longer exist. Lessons Learned– ALU/Nokia limitations Expansion plans– PNWG (Seattle), LOND, AMST

Agenda Introduction The larger problem Our Approach Security AssistBro on the WANCEASE(D)DOS Futures ESnet6

Agenda Introduction The larger problem Our Approach Security AssistBro on the WANCEASE(D)DOS Futures ESnet6BackgroundGoalsMethodArchitectureResults

CEASEESnet has limited ability to mitigate malicious traffic on the WAN and itrequires manual investigation and input of ACLs on the routers. CEASE(Correlation Evaluation and Security Enforcement) is a framework thatencompases both data analytics as well as infrastructure designed to mitigateattacks. The goal with CEASE is to make automated decisions on what ismalicious and, as a later stage, stop that traffic as it comes in from ourpeering points.Implementation will be in a series of small steps to avoid the unrelentinghilarity that comes from automating defensive measures.

CEASE : GoalsCEASE v.1 is designed to prototype passive monitoring and detection analysis.Focus on identifying network characteristics that are stable enough to be useful,systematically measuring and modeling these characteristics to identify unusualbehavior.This is expressed as a set of goals:G1: Identify attacks on network control plane focusing on peering pointtraffic.G2: Identify typical characteristics of network data volume/characteristicsfor a small number of known sites.G3: Develop simple baselines for traffic characteristics at instrumentedpeering points.G4: Topology/BGP Monitoring and Alerting.

CEASE : Goals, Organic GrowthG1: Identify attacks on network control plane focusing on peering pointtraffic.G1.1 BOW for DPI analysis of desired dataG2: Identify typical characteristics of network data volume/characteristicsfor a small number of known sites.G2.1 Site SNMP producer/consumer ratioG2.2 Site Netflow protocol ratiosG2.3 Site Netflow Port DistributionsG2.4 Site Netflow Data DistributionsG3: Develop simple baselines for traffic characteristics at instrumentedpeering points.G3.1 Peering Netflow protocol ratiosG4: Topology/BGP Monitoring and Alerting.

CEASE : Goals, Organic Growth Pruning GoalsG1: Identify attacks on network control plane focusing on peering pointtraffic.G1.1 BOW for DPI analysis of desired data BOW get its own section aswell our netflow is 1:1000 sampled and control plane deserves better .G2: Identify typical characteristics of network data volume/characteristicsfor a small number of known sites.G2.1 Site SNMP producer/consumer ratio (redundant data, theory worksbut parallel infrastructure not good use of time)G2.2 Site Netflow protocol ratiosG2.3 Site Netflow Port DistributionsG2.4 Site Netflow Data DistributionsG3: Develop simple baselines for traffic characteristics at instrumentedpeering points.G3.1 Peering Netflow protocol ratiosG4: Topology/BGP Monitoring and Alerting not completely done - see notesat end

CEASE : End Result GoalsIdentify typical characteristics of network data volume/characteristics for asmall number of known sites. G1 Site Netflow producer/consumer ratio testG2 Site Netflow protocol ratiosG3 Site Netflow Port DistributionsG4 Site Netflow Data DistributionsDevelop simple baselines for traffic characteristics at instrumented peeringpoints. G5 Peering Netflow protocol ratios

CEASE Background: Multiple SourcesFeederRoutersSite

Data Flow OverviewStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseWhen all routers present, summarize and look up model data setsBuild modelRun test

Data Flow Overview: Raw FilesStart with one or more raw nfcapd files: test dataEachroutertohasa nfcapddatafile per5 mintimetimewindow.Each site has one or moreConvertminimumfields- rawseriesso wait for all of them to get dumped :Summarize into 1 sec data units - discrete time series-rw-r--r-- 1 user group 3085294 Sep 20 00:05 sunn-cr5/2018/09/20/nfcapd.201809200000-rw-r--r-- 1 user group 1999656 Sep 20 00:05 sacr-cr5/2018/09/20/nfcapd.201809200000-rw-r--r-- 1 user group 14753 Sep 20 00:05 nersc-mr2//2018/09/20/nfcapd.201809200000Store data units in DatabaseWhen all present, summarize and look up model data setsBuild modelRun test

Data Flow Overview: Extract Desired FieldsStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesnfdump -r file -B -q -o "fmt:%ts,%pr" "as 2936"Summarize into 1 sec data units - discrete time series2018-09-19 23:59:52.880,TCP2018-09-19 23:59:52.900,TCPStore dataunits in23:59:53.990,TCPDatabase2018-09-192018-09-19 23:59:54.650,TCP2018-09-19 23:59:54.870,UDPWhen allpresent, Lookup appropriate2018-09-1923:59:54.920,TCP2018-09-19 23:59:54.920,TCP2018-09-19 23:59:54.920,UDPBuild model2018-09-19 23:59:54.990,TCP2018-09-19 23:59:55.910,TCP2018-09-19 23:59:55.980,ICMPRun test2018-09-19 23:59:55.990,UDP CEASE SCRATCH/ RAND.datamodel data sets

Data Flow OverviewStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseWhen all present, summarize and look up model data setsBuild modelRun test

Data Flow Overview: Build Data UnitsStart with one or more raw nfcapdfiles: test data2018-09-19 23:59:52.880,TCP2018-09-19 23:59:52.900,TCP2018-09-19 23:59:53.990,TCP2018-09-19 23:59:54.650,TCP2018-09-19 23:59:54.870,UDP2018-09-19 23:59:54.920,TCP2018-09-19 23:59:54.920,TCP2018-09-19 23:59:54.920,UDP2018-09-19 23:59:54.990,TCPConvert to minimumdata fields - raw time seriesFileSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseData Look upWhen all present,Unit(s)Build model1 unit / secondRun testts: 2018.TCP : 2910UDP data: 312setsappropriateDUmodelICMP : 20IPv6 : 314n: 3556

Data Flow OverviewStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseWhen all routers present, summarize and look up model data setsBuild modelRun test

Data Flow Overview: Data Units into Databasets: 2018.Start with one or more raw nfcapdfiles:test dataConvert to minimumTCP : 2910DU UDP : 312ICMP : 20data fields- raw timeIPv6 : 314n: 3556seriesSummarize into 1 sec data units - discrete time seriesStore data DUunits inDUDatabaseDU123DU4.DUmDUnWhen all routers present, look up appropriate model data setsBuild modelEach test/site/router combo gets its own data tabledB TABLE TEST lbl NOMODEL (ts, ESP, GRE, ICMP, ICMP6,IPv6, TCP, UDP, n);Run test

Data Flow OverviewStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseWhen all routers present, summarize and look up model data setsBuild modelRun test

Data Flow Overview: Sum Up Test DataStart with one or more raw nfcapdfiles:test dataDatabaseTablesholding Data UnitsPROTRATIO sunn cr5 nersc NOMODELConvert to minimum data fields - raw time seriesPROTRATIO nersc mr2 nersc NOMODELSummarize into 1 sec dataPROTRATIO sacr cr5 nersc NOMODELunits - discrete time seriesStore data units in DatabaseUWhen all routers present, look up appropriate model data setsUUUUBuildmodelUUUUU testURunU.UUUUUUUUUFinished Test Data SetUUUU.UUU

Data Flow Overview: Lookup Model DataTest Data SetStart with one or more raw nfcapd files: test data.September 2018Convert to minimum data fields - raw timeSuseriesMo Tu We Th Fr1 2 3 4 57 8series9 10 11 12Summarize into 1 sec data units - discrete time14 15 16 17 18 19Model Data Set21 22 23 24 25 26Store data units in Database28 29 30 31UUUUUUUUUUUUUUSa6132027.Whenpresent,and look up model data setsUU allUroutersU .U summarizeUUUUmodelUU .UUUBuildSame Weekday, previousRun test 3 weeks

Data Flow OverviewStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseWhen all routers present, look up appropriate model data setsBuild modelRun test

Data Flow OverviewStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time series1. Overlap Model Data SetsSummarize2.into1 sec dataunits- discreteseriesBreakresultinto1 hourtimewindows3. Independent Linear Regression over eachStore data unitsin Databasewindowfor each protocolWhen all routers present, look up appropriate model data setsBuild modelRun test

Data Flow Overview: Model Building - Overlay1.11.21.31.4.1.xModel Day 12.12.22.32.4.2.xModel Day 2Start with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time series3.13.23.33.4.3.xModel Day 3Summarize into 1. sec data units - discrete time series.Store data units in Databasey.1y.2y.3y.4.y.xModel Day yWhen all routers present, look up appropriate model data setsModel Data fromBuild model hour 1Run testDUtsTCPUDPICMPIPv6n::::::2018.2910312203143556

Data Flow Overview: WindowsEach 1 Hour windowStart with one or more raw nfcapd files: test datagets its own modelConvert to minimum data fields - raw time series% UDPSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseTimeWhen all routers present, look up appropriate model data setsSmall Variance in ModelBuild modeldata: Sensitive to slightRun testdifferences in test dataLarge Variance in Modeldata: Tolerant to slightdifferences in test data

Data Flow Overview: Now with Real Data!Start with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesRegressionSolutionSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseWhen all routers present, look up appropriate model data setsBuild modelTimeWindowRun testUse simple linear regression to estimate the behavior ofthe combined data sets and use as a model for “normal”

Data Flow OverviewStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseWhen all routers present, look up appropriate model data setsBuild modelRun test

Data Flow Overview: Run the ModelStart TestwithDataone:ornfcapdH1more rawTestData : files:H2 test dataTest Data : H3Model: H1Model- :rawH2 time seriesModel : H3Convertto minimumdata fieldsSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseHow many 𝛔 is test data y0 from model y?When all routers present, look up appropriate model data setsBuild modelRun test

Data Flow OverviewStart with one or more raw nfcapd files: test dataTwo basic takeaways:Convert to minimum data fields - raw time series1. We use the simplest first order linear regressionSummarize into 1 sec data units - discrete time seriesbecause making things more complicated did notsignificantlyStoredata units in reduceDatabaseerrors (n 1,2,3) .2.WhenBy allmakingmathlookas upsimpleas possibleit is setssimplerroutersthepresent,appropriatemodel datato see trends/errors as well as focus on what the dataBuildis modeltelling us.Run test

Data Flow OverviewStart with one or more raw nfcapd files: test dataConvert to minimum data fields - raw time seriesSummarize into 1 sec data units - discrete time seriesStore data units in DatabaseWhen all routers present, look up appropriate model data setsBuild modelRun test

CEASE v.2 ArchitectureSame set of analysis with the following architectural changes: All (non raw netflow) data stored in database.Analysis broken into discrete/independent stages.Use RabbitMQ messaging bus.No stage communicates with another except via themessaging bus. Meaningful data passed to logging system.

CEASE v.2 ArchitectureSpawn of CEASE v.1

CEASE v.2 Data InputSpawn of CEASE v.1log consume cease loggerNew file detection and translation from raw nfcapd to structuredtext.Inject processed data into message bus.Data loop and model building.Simple data flow control

CEASE v.2 Data NormalizeSpawn of CEASE v.1test complete cease loggerConvert raw time series data to 1 minute summaries.Take summary data and insert into database - natural reductionof data to predictable, easy to handle volume/rate.Signal router data completion to box 3.

CEASE v.2 Anomaly DetectionSpawn of CEASE v.1anom scoremodel complete cease loggerTrack router -- site mapping.Inject processed data into message bus.Extract old model from dB, then process test data.Simple linear regression modeling for identifying patterns oflarge S.D.

CEASE v.2 CorrelationSpawn of CEASE v.1cease logger New - moving some complex code from A.D. into this layer.

CEASE v.2 Data OutputSpawn of CEASE v.1 CEASE v.1 used stdout for reporting - rework everything to gothrough the python logger and attach to a RMqexchange/queue.(todo) Build a simple API for extracting results and attach to4433/tcp w/ TLS.

CEASE: ResultsG1 Site Netflow producer/consumer ratio testStable P/C ratios for NERSC, LLNL, BNL, LBL for a 24 hour period.

CEASE: ResultsG2 Site Netflow protocol CPTCPTCPTCPUDPsd: 5.0034sd: 4.9692sd: 3.2250sd: 3.2791sd: 6.7495sd:12.6982sd: 4.5887sd: 3.6638sd: 3.6653sd: 3.7310sd: 4.1528sd: 410.75880.4015--------------------coldThreshold EXCEEDED------------WINDOW THRESHOLD EXCEEDED---------

Agenda Introduction The larger problem Our Approach Security AssistBro on the WANCEASE(D)DOS Futures ESnet6

(D)DOS FuturesDenial of service attacks can be broken down into a small number of groupings whichcan be directed against both internal ESnet infrastructure as well as end sites.1.2.3.(Distributed) Volumetric : (UDP: ntp/dns/upnp/fragment, TCP SYN flood, ICMP )Distributed Application : TCP 0 win, slow HTTP request, evil query, hash crash etcSingle host Application : SYN scan, etc

(D)DOS DetectionFrom a site perspective we have:ESnet / Site detection - look for unexpected bursts in high risk traffic.1. (Distributed) Volumetric : (UDP: ntp/dns/upnp/fragment, TCP SYN flood,ICMP )Site detection: application layer attacks outside scope of WAN detection.2. Distributed Application : TCP 0 win, slow HTTP request, evil query, hashcrash etc3. Single host Application : SYN scan, etc

(D)DOS Mitigation : PassiveVolumetric attacks rely on traffic volumes for protocols that should never benormally seen. We propose building into external peerings and internal sitetransit rate limits for protocols that are traditional DDOS participants.Monitor flow rates for these protocols and alarm if the values are too high (butless than rate limit) - will allow for manual adjustment.

(D)DOS Mitigation : ActiveThe two best tools that we have for mitigation purposes are : IP BHR : Good for large number of IP addresses BGP FlowSpec : Much finer level of control and options formatching. Significantly less capacity than BHR

Questions or Comments? Introduction The larger problem Our Approach Security AssistBro on the WANCEASE(D)DOS Futures ESnet6

CEASE v.1 Goal 4Topology/BGP Monitoring and AlertingThis section still under development. Questions we intend toanswer: Has anyone stolen our cheese? (BGPmon)Are we advertising cheese that should not be shared.Is anybody eating stolen cheese?Am I walking through the neighbors root cellar to get tomy cheese box in the room next to me?

CEASE: ResultsG2 Site Netflow protocol ratiosAuto AlarmThresholdSigmaCount1224 33InterestingThreshold22Time3 interestingthen alarm5 interestingclear windowcounters

ESnet WAN Security Updates 2018 Technology Exchange October 16, 2018 Scott Campbell ESnet Network Security