Securing Your Big Data Environment - Black Hat

Transcription

Securing Your Big Data EnvironmentAjit Gaddamajit@root777.comAbstractSecurity and privacy issues are magnified by the volume,variety, and velocity of Big Data. The diversity of datasources, formats, and data flows, combined with thestreaming nature of data acquisition and high volumecreate unique security risks.This paper details the security challenges whenorganizations start moving sensitive data to a Big Datarepository like Hadoop. It identifies the different threatmodels and the security control framework to address andmitigate security risks due to the identified threatconditions and usage models. The framework outlined inthis paper is also meant to be distribution agnostic.Keywords: Hadoop, Big Data, enterprise, defense, risk, BigData Reference Framework, Security and Privacy, threatmodel1IntroductionThe term “Big Data” refers to the massive amounts ofdigital information that companies collect. Industryestimates on the growth rate of data is roughly doubleevery two years, from 2500 Exabytes in 2012 to 40,000Exabytes in 2020 [1]. Big data is not a specifictechnology. It is a collection of attributes andcapabilities.NIST defines Big Data as the following [2]:Big Data consists of extensive datasets, primarily in thecharacteristics of volume, velocity, and/or variety thatrequire a scalable architecture for efficient storage,manipulation, and analysis.Securosis research [3] adds additional characteristicsfor a particular environment to qualify as ‘Big Data’.1. It handles a petabyte of data or more2. It has distributed redundant data storage3. Can leverage parallel task processing4. Can provide data processing (MapReduce orequivalent) capabilities5. Has extremely fast data insertion6. Has central management and orchestration7.8.Is hardware agnosticIs extensible where its basic capabilities can beaugmented and alteredSecurity and privacy issues are magnified by thevolume, variety, and velocity of Big Data. The diversity ofdata sources, formats, and data flows, combined with thestreaming nature of data acquisition and high volumecreate unique security risks.It is not merely the existence of large amountsof data that is creating new security challenges fororganizations. Big Data has been collected and utilizedby enterprises for several decades. Softwareinfrastructures such as Hadoop enable developers andanalysts to easily leverage hundreds of computing nodesto perform data-parallel computing which was not therebefore. As a result, new security challenges have arisenfrom the coupling of Big Data with heterogeneouscompositions of commodity hardware with commodityoperating systems, and commodity softwareinfrastructures for storing and computing on data. As BigData expands at the different enterprises, traditionalsecurity mechanisms tailored to securing small-scale,static data and data flows on firewalled and semiisolated networks are inadequate. Similarly, it is unclearhow to retrofit provenance in an enterprise’s existinginfrastructure. Throughout this document, unlessexplicitly called out, Big Data will refer to the Hadoopframework and its common NoSQL variants (e.g.Cassandra, MongoDB, Couch, Riak, etc.).This paper details the security challenges whenorganizations start moving sensitive data to a Big Datarepository like Hadoop. It provides the different threatmodels and the security control framework to addressand mitigate the risk due to the identified securitythreats. In the following sections, the paper describes inthe detail the architecture of the modern Hadoopecosystem and identify the different securityweaknesses of such systems. We then identify thedifferent threat conditions associated with them andtheir threat models. This paper concludes the analysis byproviding a reference security framework for anenterprise Big Data environment.

2 Hadoop Security WeaknessCryptographically enforced data centricsecurity Granular access controlData Management Secure data storage and transactions logs Granular audits Data provenanceIntegrity & Reactive Security End-point input validation/filtering Real-time security monitoringTraditional Relational Database Management Systems(RDBMS) security has evolved over the years and withmany ‘eyeballs’ assessing the security through varioussecurity evaluations. Unlike such solutions, Hadoopsecurity has not undergone the same level of rigor orevaluation for that matter and thus can claim littleassurance of the implemented security.Another big challenge is that today, there is nostandardization or portability of security controlsbetween the different Open-Source Software (OSS)projects and the different Hadoop or Big Data vendors.Hadoop security is completely fragmented. This is trueeven when the above parties implement the samesecurity feature for the same Hadoop component.Vendors and OSS parties’ force-fit security into theApache Hadoop framework.2.1Top 10 Security & PrivacyChallengesFigure 1: CSA- classification of the Top 10 ChallengesThe Cloud Security Alliance Big Data Security WorkingGroup has compiled the following as the Top 10security and privacy challenges to overcome in Big Data[4].1. Secure computations in distributedprogramming frameworks2. Security best practices for non-relational datastores3. Secure data storage and transactions logs4. End-point input validation/filtering5. Real-time security monitoring6. Scalable privacy-preserving data mining andanalytics7. Cryptographically enforced data centricsecurity8. Granular access control9. Granular audits10. Data provenance2.2Additional Security WeaknessesThe earlier section regarding Cloud Security Alliance listis an excellent start and this research and papersignificantly adds to it. Where possible, effort has beenmade to map back to the categories identified in the CSAwork. This section lists some additional securityweaknesses associated with Open Source Software(OSS) like Apache Hadoop. It is meant to give the readeran idea of the possible attack surface. However it’s notmeant to be exhaustive which subsequent sections willprovide and add to.Infrastructure Security & Integrity The Common Vulnerabilities and Exposures(CVE) database only shows four reporting andfixed Hadoop vulnerabilities over the past threeyears. Software, even Hadoop, is far fromperfect. This could either reflect that thesecurity community is not active or that most ofvulnerability remediation happens internallywithin the vendor environments themselveswith no public reporting. Hadoop security configuration files are not selfcontained with no validity checks prior to suchpolicies being deployed. This usually results indata integrity and availability issues.The above challenges were grouped into four broadcomponents by the Cloud Security Alliance. They were:Infrastructure Security Secure computations in distributedprogramming frameworks Security best practices for non-relational datastoresData Privacy Scalable privacy-preserving data mining andanalyticsIdentity & Access Management2

enforced and the scheduler might not be able tofind resources next to the data and force it toread data over the network.Role Based Access Control (RBAC) policy filesand Access Control Lists (ACLs) for componentslike MapReduce and HBase are usuallyconfigured via clear-text files. These files areeditable by privileged accounts on the systemlike root and other application accounts.3Big Data Security FrameworkThe following section provides the target securityarchitecture framework for Big Data platform security.The core components of the proposed Big Data SecurityFramework are the following:1. Data Management2. Identity & Access Management3. Data Protection & Privacy4. Network Security5. Infrastructure Security & IntegrityData Privacy & Security All issues associated with SQL injection type ofattacks don’t go away. They move with Hadoopcomponents like Hive and Impala. SQL preparefunctions are currently not available whichwould have enabled separation of the query anddata Lack of native cryptographic controls forsensitive data protection. Frequently, suchsecurity is provided outside the data orapplication stack. Clear-text data might be sent whencommunicatingbetweenDataNodetoDataNode since data locality cannot be strictlyThe above ‘5 pillars’ of Big Data Security Framework arefurther decomposed into 21 sub-components, each ofwhich are critical to ensuring the security and mitigatingthe security risk and threat vectors to the Big Data stack.The overall security framework is shown below.Data ManagementData ClassificationData DiscoveryData TaggingIdentity & Access namenode-to-other mgmt. nodes)AD, LDAP, KerberosData Metering UserEntitlementRBAC AuthorizationServer, DB, Table, View based AuthorizationData Protection & PrivacyData Masking / Data redactionTokenizationField Level / column levelEncryptionDisk level TransparentEncryptionHDFS File/Folder EncryptionData Loss PreventionNetwork SecurityPacket Level EncryptionIn Cluster (namenodejobtacker-datanode)Packet Level EncryptionClient-to-clusterSSL, TLSSSL, TLSPacket Level EncryptionIn Cluster (mapper-reducer)SSL, TLSNetwork SecurityZoningInfrastructure Security & IntegrityLogging / AuditSecure EnhancedLinuxFile Integrity / Data TamperMonitoringFigure 2: Big Data Security Framework3Privileged User & ActivityMonitoring

5.3.1Data ManagementData Management component is decomposed into threecore sub-components. They are Data Classification, DataDiscovery, and Data Tagging.3.1.1Determine impact to the owner of the PII data(e.g. a customer)a. Does the field cause phishing attacks(e.g. email) vs. just replace it (e.g. lossof a credit card)The following figure is a sample representation ofcertain Personally Identifiable Data fieldsData ClassificationEffective data classification is probably one of the mostimportant activities that can in-turn lead to effectivesecurity control implementation in a Big Data platform.When organizations deal with an extremely largeamount of data, aka Big Data, by clearly being able toidentify what data matters, what needs cryptographicprotection among others, and what fields need to beprioritized first for protection, more often than notdetermine the success of a security initiative on thisplatform.Figure 3: Data Classification Matrix3.1.2The following are the core items that have beendeveloped over time and can lead to a successful dataclassification matrix of your environment.1. Work with your legal, privacy office, IntellectualProperty, Finance, and Information Security todetermine all distinct data fields. An openbucket like health data is not sufficient. Thisexercise encourages the reader to go beyondthe symbolic policy level exercise.2. Perform a security control assessment exercise.a. Determine location of data (e.g.exposed to internet, secure data zone)b. Determine number of users andsystems with accessc. Determine security controls (e.g. can itbe protected cryptographically)3. Determine value of the data to the attackera. Is the data easy to resell on the blackmarket?b. Do you have valuable IntellectualProperty (e.g. a nation state looking fornuclear reactor blueprints)4. Determine Compliance and Revenue Impacta. Determinebreachreportingrequirements for all the distinct fieldsb. Does loss of a particular data fieldprevent you from doing business (e.g.card holder data)c. Estimate re-architecting cost forcurrent systems (e.g. buying newsecurity products)d. Other costs like more frequentauditing, fines and judgements andlegal expenses related to complianceData DiscoveryThe lack of situational awareness with respect tosensitive data could leave an organization exposed tosignificant risks. Identifying whether sensitive data ispresent in Hadoop, where it is located and subsequentlytriggering the appropriate data protection measures,such as data masking, data redaction, tokenization orencryption is key. For structured data going into Hadoop, such asrelational data from databases, or, for example,comma-separated values (CSV) or JavaScript ObjectNotation (JSON)-formatted files, the location andclassification of sensitive data may already beknown. In this case, the protection of those columnsor fields can occur programmatically, with, forexample, a labeling engine that assigns visibilitylabels/cell level security to those fields. With unstructured data, the location, count andclassification of sensitive data becomes much moredifficult. Data discovery, where sensitive data can beidentified and located, becomes an important firststep in data protection.The following items are crucial for an effective datadiscovery exercise of your Big Data environment:1. Define and validate the data structure andschema. This is all useful prep work for dataprotection activities later2. Collect metrics (e.g. volume counts, uniquecounts etc.). For example, if a file has 1Mrecords but it is duplicate of a single person, itis a single record vs. 1M records. This is veryuseful for compliance but more importantlyrisk management.4

3.4.5.6.3.1.3Share this insight with your Data Science teamsfor them to build threat models, profiles whichwill be useful in data exfiltration preventionscenarios.If you discover sequence files, work with yourapplication teams to move away from this datastructure. Instead leverage columnar storageformats such as Apache Parquet where possibleregardless of the data processing framework,data mode, or programming language.Build conditional search routines (e.g. onlyreport on date of birth if a person’s name isfound or Credit Card # CVV or CC zip)Account for usecases where once sensitive datahas been cryptographically protected (e.g.encrypted or tokenized), what is the usecase forthe discovery solution.3.2.1Data TaggingDeliver fine-grained authorization through Role BasedAccess Control (RBAC). Manage data access by role (and not user) Determine relationships between users & rolesthrough groups. Leverage AD/LDAP groupmembership and enforce rules across all dataaccess pathsProvide users access to data by centrally managingaccess policies. It is important to tie policy to data and not to theaccess method Leverage Attribute based access control andprotect data based on tags that move with thedata through lineage; permissions decisions canleverage the user, environment (e.g. location),and data attributes. Perform data metering by restricting access todata once a normal threshold (as determined byaccess models machine learning algorithms)is passed for a particular user/application.3.2.2Understand the end-to-end data flows in your Big Dataenvironment, especially the ingress and egressmethods.1.2.3.4.5.3.2User Entitlement Data MeteringIdentify all the data ingress methods in yourBig Data cluster. These would include allmanual (e.g. Hadoop admins) or automatedmethods (e.g. ETL jobs) or those that gothrough some meta-layer (e.g. copy files orcreate write).Knowing whether data is coming in leveragingCommand Line Interface or though some JavaAPI or through Flume or Sqoop import of if it isbeing SSH’d in is important.Similarly, follow the data out and identify allthe egress components out of your Big Dataenvironment.This includes whether reporting jobs are beingrun through Hive queries (e.g. throughODBC/JDBC), through Pig jobs (e.g. readingfiles or Hive tables or HCatalog), or exporting itout via Sqoop or copying via REST API, Hue etc.will determine your control boundaries andtrust zones.All of the above will also help in data discoveryactivity and other data access managementexercises (e.g. to implement RBAC, ABAC, etc.)3.3RBAC AuthorizationData Protection & PrivacyThe majority of the Hadoop distributions and vendoradd-ons package either data-at-rest encryption at ablock or (whole) file level. Application levelcryptographic protection (like n,anddataredaction/masking provide the next level of securityneeded.3.3.1 Application Level Cryptography(Tokenization, field-level encryption)While encryption at the field/element level can offersecurity granularity and audit tracking capabilities, itcomes at the expense of requiring manual interventionto determine the fields that require encryption andwhere and how to enable authorized decryption.3.3.2Transparent Encryption (disk / HDFS layer)Full Disk Encryption (FDE) prevents access via thestorage medium. File encryption can also guard against(privileged) access at the node's operating-system level. In case you need to store and process sensitiveor regulated data in Hadoop, data-at-restencryption protects your organization’sIdentity & Access ManagementPOSIX-style permissions in secure HDFS are the basis formany access controls across the Hadoop stack.5

3.3.3sensitive data and keeps at least the disks out ofaudit scope.In larger Hadoop clusters, disks often need to beremoved from the cluster and replaced. DiskLevel transparent encryption ensures that nohuman-readable residual data remains whendata is removed or when disks aredecommissioned.Full-disk encryption (FDE) can also be OSnative disk encryption, such as dm-crypt1.Data Masking/ Data Redaction5.2.3.4.Data masking or data redaction before load in the typicalETL process de-identifies personally identifiableinformation (PII) data before load. Therefore, nosensitive data is stored in Hadoop, keeping the HadoopCluster potentially out of (audit) scope. This may be performed in batch or real time andcan be achieved with a variety of designs,including the use of static and dynamic datamasking tools, as well as through data services.3.43.4.2Packet level encryption using TLS from theclient to Hadoop clusterPacket level encryption using TLS within thecluster itself. This includes using https betweenNameMode to Job Tracker to DataNode.Packet level encryption using TLS in the cluster(e.g. mapper-reducer)Use LDAP over SSL (LDAPS) rather than LDAPwhen communicating with the corporateenterprise directories to prevent sniffingattacks.Allow your admins to configure and enableencrypted shuffle and TLS/https for HDFS,MapReduce, YARN, HBase UIs etc.Network Security ZoningThe Hadoop clusters must be segmented into points ofdelivery (PODs) with chokepoints such as Top of Rack(ToR) switches where network Access Control Lists(ACLs) limit the allowed traffic to approved levels. Network SecurityThe Network Security layer is decomposed into foursub-components. They are data protection in-transitand network zoning authorization components. 3.4.1 Data Protection In-TransitSecure communications are required for HDFS toprotect data-in-transit. There are multiple threatscenarios that in turn mandate the necessity for httpsand prevent information disclosure or elevation ofprivilege threat categories. Using the TLS protocol(which is now available in all Hadoop distributions) toauthenticate and ensure privacy of communicationsbetween nodes, name servers, and applications. An attacker can gain unauthorized access todata by intercepting communications toHadoop consoles. This could include communication betweenNameNodes and DataNodes that are in the clearback to the Hadoop clients and in turn canresult in credentials/data to be sniffed. Tokens that are granted to the user postKerberos authentication can also be sniffed andcan be used to impersonate users on theNameNode.3.5End users must not be able to connect to theindividual data nodes, but to the name nodesonly.The Apache Knox gateway for example,provides the capability to control traffic in andout of Hadoop at the per-service-levelgranularity.A basic firewall that should allow access only tothe Hadoop NameNode, or, where sufficient, toan Apache Knox gateway. Clients will neverneed to communicate directly with, forexample, a DataNode.Infrastructure Security & IntegrityThe Infrastructure Security & Integrity layer isdecomposed into four core sub-components. They areLogging/Audit, Secure Enhanced Linux (SELinux), FileIntegrity Data Tamper Monitoring, and Privileged Userand Activity Monitoring.3.5.1Logging / AuditAll system/ecosystem changes unique to Hadoopclusters need to be audited with the audit logs beingprotected. Examples include: Addition/deletion of data and managementnodes Changes in management node states includingjob tracker nodes, name nodes Pre-shared secrets or certificates that are rolledout when the initial package of the HadoopFollowing are the controls that when implemented in aBig Data cluster can ensure properties of dataconfidentiality.6

distribution or of the security solution ispushed to the node prevent the addition ofunauthorized cluster nodes.SELinux is an example of a Mandatory AccessControl (MAC) for Linux. Historically Hadoop and otherBig Data platforms built on top of Linux and UNIXsystems have had discretionary access control. Whatthis means for example is that a privileged user like rootis omnipotent. By enforcing and configuring SELinux on yourBig Data environment, through MAC, there ispolicy which is administratively set and fixed. Even if a user changes any settings on theirhome directory, the policy prevents anotheruser or process from accessing it. A sample policy for example that can beimplemented is to make library files executablebut not writable or vice-versa. Jobs can write to/tmp location but not be able to executeanything in there. This is a great way to preventcommand injection attacks among others. With policies configured, even if someone whois a sysadmin or a malicious user is able to gainaccess to root using SSH or some other attackvector, they may be able to read and write a lotof stuff. However, they won’t be able to executeanything incl. potentially any data exfiltrationmethods.When data is not limited to one of the core Hadoopcomponents, Hadoop data security ends up having manymoving parts and high percentage of fragmentation.Consequently, there results a sprawl of metadata andaudit logs across all fragments.In a typical enterprise, the DBAs are typicallyleveraged to place the security responsibility at thetable, row, column, or cell level and while theconfiguration of file systems and system administrators,and the Security Access Control team is usuallyaccountable for the more granular file level permissions.Yet, in Hadoop, POSIX-style HDFS permissions arefrequently important for data security or are at times theonly means to enforce data security at all. This leads toquestions concerning the manageability of Hadoopsecurity. Technologies recommendations to addressdata fragmentation: 3.5.2Apache Falcon is an incubating Apache OSSproject that focuses on data management. Itprovides graphical data lineage and activelycontrols the data life cycle. Metadata isretrieved and mashed up from wherever theHadoop application stores it.Cloudera Navigator is a proprietary tool andGUI that is part of Cloudera's DistributionIncluding Apache Hadoop (CDH) distribution.CDH Navigator is a tool to address log sprawl,lineage and some aspects of data discovery.Metadata is retrieved and mashed up fromwherever the Hadoop application stores it.Zettaset Orchestrator is a product forharnessing the overall fragmentation of Hadoopsecurity with a proprietary combined GUI andworkflow. Zettaset has its own metadatarepository where metadata from all Hadoopcomponents is collected and stored.The general recommendation is to run SELinux ispermissive mode with regular workloads on yourcluster, reflecting typical usage, including using anytools. The warnings generated can then be used to definethe SELinux policy which after tuning can be deployed ina ‘targeted enforcement’ mode.4Final RecommendationsThe following are some key recommendations inhelping mitigate the security risks and threats identifiedin the Big Data ecosystem.1.Secure Enhanced Linux (SELinux)SELinux was created by the United States NationalSecurity Agency (NSA) as a set of patches to the LinuxKernel using Linux Security Modules (LSM). It waseventually released by the NSA under the GPL licenseand has been adopted by the upstream Linux kernel.2.7Select products and vendors that have provenexperience in similar-scale deployments.Request vendor references for largedeployments (that is, similar in size to yourorganization) that have been running thesecurity controls under consideration for yourproject for at least one yearKey pillars are: Accountability, balancingnetwork centric, access-control centric, anddata centric security is absolutely critical inachieving a good overall trustworthy securityposture.

3.4.5.5Data-centric security, such as label security orcell-level security for sensitive data ispreferred. Label security and cell-level securityare integrated into the data or into theapplication code rather than adding datasecurity after the factExternalize data security when possible anduse data redaction, data masking ortokenization at the time of ingestion, or usedata services with granular controls to accessHadoopHarness the log and audit sprawl with datamanagement tools, such as OSS Apache Falcon,ClouderaNavigatorortheZettasetOrchestrator. This helps achieve dataprovenance in the long runprovides a target reference architecture around Big Datasecurity and covers the entire control stack.Hadoop and big data represent a green fieldopportunity for security practitioners. It provides achance to get ahead of the curve, test and deploy yourtools, processes, patterns, and techniques before bigdata becomes a big problem.References[1] EMC Big Data 2020 rse/iview/big-data-2020.htm[2] NIST Special Publication 1500-1 NIST Big DataInteroperability Framework: Volume 1, Definitionshttp://bigdatawg.nist.gov/ uploadfiles/M0392 v13022325181.pdfRelated WorkA lot of publications have been released in the recentpast around Hadoop. However, there are very few tonone around Big Data and Hadoop security. This isgetting remediated with the book Hadoop in Action,Second Edition from Manning Publications [5] set to bepublished towards the end of 2015.[3] Securosis – Securing Big Data Security issues withHadoop g-datasecurity-issues-with-hadoop-environments[4] Top 10 Big Data Security and Privacy Challenges,Cloud Security Alliance, itiatives/bdwg/Big Data Top Ten v1.pdfThis book will integrate security as part of the overallHadoop ecosystem including greater depth andarticulation of the concepts presented in this paper.6[5] Hadoop in Action, Second Edition by ManningPublications. ISBN: nHadoop and big data are no longer buzz words in largeenterprises. Whether for the correct reasons or not,enterprise data warehouses are moving to Hadoop andalong with it come petabytes of data.In this paper we have laid the groundwork forconducting future security assessments on the Big Dataecosystem and securing it. This is to ensure that Big Datain Hadoop does not become a big problem or a big target.Vendors pitch their technologies as the magical silverbullet. However, there are many challenges when itcomes to deploying security controls in your Big Dataenvironment.This paper also provides the Big Data threatmodel which the reader can further expand andcustomize to their organizational environment. It also8

NIST defines Big Data as the following [2]: Big Data consists of extensive datasets, primarily in the characteristics of volume, velocity, and/or variety that require a scalable architecture for efficient storage, manipulation, and analysis. Securosis research [3] adds additional characteristics for a parti