263. Securing Big Data Hadoop A Review Of Security Issues . - IJCSIT

Transcription

Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131Securing Big Data Hadoop: A Review of SecurityIssues, Threats and SolutionPriya P. SharmaChandrakant P. NavdetiInformation Technology DepartmentSGGS IE&T, Nanded, IndiaInformation Technology DepartmentSGGS IE&T, Nanded, IndiaAbstract— Hadoop projects treat Security as a top agendaitem which in turn represents which is again classified as acritical item. Be it financial applications that are deemedsensitive, to healthcare initiatives, Hadoop is traversing newterritories which demand security-subtle environments. Withthe growing acceptance of Hadoop, there is an increasingtrend to incorporate more and more enterprise securityfeatures. In due course of time, we have seen Hadoopgradually develop to label important issues pertaining to,what we summarize as 3ADE (authentication, authorization,auditing, and encryption) within a cluster. There is no dearthof Production environments that support Hadoop Clusters.Inthis paper, we aim at studying “Big Data” security at theenvironmental level, along with the probing of built-inprotections and the Achilles heel of these systems, and alsoembarking on a journey to assess a few issues that we aredealing with today in procuring contemporary Big Data andproceeds to propose security solutions and commerciallyaccessible techniques to address the same.Not only Big Data is about the size of data but also includesdata variety and data velocity. Together, these threeattributes form the three V’s of Big DataKeywords—Big Data, SASL, delegation, sniffing, cell level,variety, unauthorizedI. INTRODUCTIONSo, what exactly is “Big data”. Put in simple words, it isdescribed as mammoth volumes of data which might beboth structured and unstructured. Generally, it is so giganticthat is provides a challenge to process usingconventional database andsoftware techniques.Aswitnessed in enterprise scenarios, three observations can beinferred;1. The data is stupendous in terms of volumes.2. It moves at a very fast pace.3. It outpaces the prevailing capacity.The volumes of Big Data are on a roll, which can beinferred from the fact that as far back in the year 2012,there were a few dozen terabytes of data in a single dataset,which has interestingly been catapulted to many petabytestoday.To carter to the demands of the industry, new manifestos ofmanipulating “Big Data” are being commissioned.Quick fact: 5 exabytes (1 Exabyte 1.1529*1018 bytes) ofdata were created by humans until 2003. Today this amountof information is created in two days [8, 16]. In 2012,digital world of data was expanded to 2.72 zettabytes (1021bytes). It is predicted to double every two years, reachingthe number about 8 zettabytes of data by 2015 [8, 16]. Withan increase in the data, there is a corresponding increase inthe applications and framework to administer it. This givesrise to new vulnerabilities that need being responded to.www.ijcsit.comFig.1 Three V's Of Big-Data [17]Each of the V’s represented in Figure 1 are depicted asbelow:Volume or the size of data in present time is larger thanterabytes and petabytes. That data is comes from machines,networks and human interaction on systems like socialmedia the volume of data to be analysed is very huge. [8]Velocity defines the speed of data processing, is requirednot only for big data, but also all processes, and involves,real time processing,batch processing.Variety refers todifferent types of data from different ormany sources both structured and unstructured. In Past datawas stored from sources like spreadsheets and databases.Now in this data comes in the form of emails, pictures,audio, videos, monitoring devices, PDFs, etc. Thismultifariousness of unstructured data creates problems forstorage, mining and analysing the data. [8]To process thelarge volume of data from different sources, for fastprocessing Hadoop is used.Hadoop is a free, Java-based programming framework thatsupports the processing of large data sets in a gapplications on systems with thousands of nodes withthousands of terabytes of data [2]. Its distributed file systemsupports fast data transfer rates among nodes and allowsthe system to continue operating uninterrupted at times ofnode failure.2126

Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131Hadoop consists of distributed file system, data storage andanalytics platforms and a layer that handles parallelcomputation, rate of flow (workflow) and configurationadministration [8]. HDFS runs across the nodes in aHadoop cluster and together connects the file systems onmany input and output data nodes to make them into onebig file system [2]. The present Hadoop ecosystem (asshown in fig 2.) consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and anumber of related components such as Apache Hive,HBase, Oozie, Pig and Zookeeper and these componentsare explained as below[7,8]: HDFS: A highly faults tolerant distributed filesystem that is responsible for storing data on theclusters.MapReduce: A powerful parallel programmingtechnique for distributed processing of vastamount of dataon clusters.HBase: A column oriented distributed NoSQLdatabase for random read/write access.Pig: A high level data programming language foranalyzing data of Hadoop computation.Hive: A data warehousing application thatprovides a SQL like access and relational model.Sqoop: A project for transferring/importing databetween relational databases and Hadoop.Oozie: Anorchestrationand workflowmanagement for dependent Hadoop jobs. Fig. 2 Hadoop ArchitectureThe paper is organised asfollows: In section II we describeBig Data Hadoop traditional security and also discussweakness of the same, security threats, we have describevarious security issues in Section III, Section IV we presentour analysis of security solution for each of the hadoopcomponents in tabular format and section V is also ananalysis of security technologies used to secure Hadoop.Finally we conclude in section VI.www.ijcsit.comII. BIG DATA HADOOP ‘S TRADITIONAL SECURITYA. Hadoop Security OverviewOriginally Hadoop was developed without security in mind,no security model, no authentication of users and servicesand no data privacy, so anybody could submit arbitrarycode to be executed. Although auditing and authorizationcontrols (HDFS file permissions and ACLs) were used inearlier distributions, such access control was easily evadedbecause any user could impersonate any other user.Because impersonation was frequent and done by mostusers, the security controls measures that did subsist werenot very effective. Later authorization and authenticationwas added, but that to have some weakness in it. Becausethere were very few security control measures withinHadoop ecosystem, many fortuity and security incidentshappened in such environments. Well-meant users canmake mistakes (e.g. deleting massive amounts of datawithin seconds with a distributed delete). All users andprogrammers had the same level of access privileges to allthe data in the cluster, any job could access any of the datain the cluster, and any user could read any data set [4].Because MapReduce had no concept of authentication orauthorization, an impish user could lower the priorities ofother Hadoop jobs in order to make his job complete fasteror to be executed first – or worse, he could kill the otherjobs.Hadoop is an entire eco-system of applications thatinvolves Hive, HBase, Zookeeper, Oozie, and Job Tracker,and not just a single technology. Each of these applicationsrequires hardening. To add security potentials orcapabilities into a big data environment, functions need toscale with the data. Supplementary security does not scalewell, and simply cannot keep up. [6]The Hadoop community supports some security featuresthrough the current Kerberosimplementation, the use offirewalls, and basic HDFS permissions and ACLs [5].Kerberos is not a compulsory requirement for a Hadoopcluster, making it possible to run entire clusters withoutdeploying or implementingany security. Kerberos is alsonot very easy to install and configure on the cluster, and tointegrate with Active Directory (AD) and LightweightDirectory Access Protocol, (LDAP) services. [6]This makes security problematic to be implemented, andthus limits the adoption of even the most basic securityfunctions forusers of Hadoop. Hadoop security is notproperly addressed by firewalls, once a firewall is breached;the cluster is wide-open for attack. Firewalls offer noprotection for data at-rest or in-motion within the cluster.Firewalls also offer no protection from security failurewhich originates from within the firewall perimeter [6]. Anattacker who can enter the data centre either physically orelectronically can steal the data they want, since the data isun-encrypted and there is no authentication enforced foraccess[6, 10].B. Security ThreatsWe have identified three categories of security rizedmodification of information and denial of2127

Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131resources. The following are the related areas of threat weidentify in Hadoop [7]: An unauthorized user may access an HDFS filevia the RPC or via HTTP protocols and couldexecute arbitrary code or carry out further attacks An unauthorized client may read/write a datablock of a file at a DataNode via the pipelinestreaming Data-transfer protocol. An unauthorized client may gain access privilegesand may submit a job to a queueor delete orchange priority of the job. An unauthorized user may access intermediatedataof Map job via its task trackers HTTPshuffleprotocol. A task in execution may use the host OS interfacesto access other tasks, or would accesslocal datawhich include intermediate Map outputor the localstorage of the DataNode that runs on the samephysical node. An unauthorized user may eavesdrop/sniff to datapackets being sent by Data nodes to client. A task or node may masquerade as a Hadoopservice component such as a DataNode,NameNode, job tracker, task tracker etc. A user may submit a workflow to Oozie asanother user. DataNodes imposed no access control, aunauthorized user could read arbitrary data blocksfrom DataNodes, bypassing access controlmechanism/restrictions, or writing garbage data toDataNode.[10]III. SECURITY ISSUESHadoop present some unique set of security issues fordata centre managers and security professionals. Thesecurity issues are depicted below [5, 6]:1) Fragmented Data: Big Data clusters contain data thatportray the quality of fluidity, allowing multiple copiesmoving to-and-fro various nodes ensuring redundancyand resiliency.The data is available for fragmentationand can be shared across multiple servers. As a result,more complexity is added as a result of thefragmentation which poses a security issue due to theabsence of a security model.2) Distributed Computing: Since, the availability ofresources leads to virtual processing of data at anyinstant or instance where it is available, this progressesto large levels of parallel computation. As a result,complicated environments are created that are at highrisks of attacks than their counterparts of repositoriesthat are centrally managed and monolithic, whichenables easier security implications.3) Controlling Data Access:Commissioned dataenvironments provision access at the schema level,devoid of finer granularity in addressing proposedusers in terms of roles and access related scenarios.Many of the available database security schemasprovide role based access.4) Node-to-node communication: A concern with Hadoopand a variety of players available in this field is that,www.ijcsit.comthey don’t implement secure communication; theybring into use the RPC (Remote Procedure Call) overTCP/IP.5) Client Interaction: Communication of client takes placewith resource manager, data nodes. However, there is acatch. Even though efficient communication isfacilitated by this model, it makes cumbersome toshield nodes from clients and vice-versa and also nameservers from nodes. Clients that have beencompromised tend to propagate malicious data or linksto either service.6) Virtually no security:Big data stacks were designed withlittle or no security in mind. Prevailing big datainstallations are built on the web services model, withfew or no facilities for preventing common web threatsmaking it highly susceptible.IV. HADOOP SECURITY SOLUTIONHadoop is a distributed system which allows us to storehuge amounts of data and processing the data in parallel.Hadoop is used as a multi-tenant service and storessensitive data such as personally identifiable information orfinancial data. Other organizations, including financialorganizations, using Hadoop are beginning to storesensitive data on Hadoop clusters. As a result, strongauthentication and authorization is necessary. [7]The Hadoop ecosystem consists of various components.We need to secure all the other Hadoop ecosystemcomponents. In this section, we will look at the each of theecosystem components security and the security solutionfor each of these components, each component has its ownsecurity challenges, issues and needs to be configuredproperly based on its architecture to secure them. Each ofthese hadoop components has end users directly accessingthe component or a backend service accessing the Hadoopcore components (HDFS and Map-Reduce).We have done a security analysis of hadoop componentsand a brief study of built-in security of the Hadoopecosystem and we see that hadoop security is not verystrong, so in this paper we provide with a security solutionaround the four security pillars i.e. authentication,authorization, encryption and audits (we summarize as3ADE), for each of the ecosystem components. Thissection describes the four pillars (sufficient and necessary)to help secure the Hadoop cluster, we will narrow our focusand take a deep dive into the built-in and our proposedsecurity solution for the Hadoop ecosystemA. AuthenticationAuthentication is verifying user or system identityaccessing the system. Hadoop provides Kerberos as aprimary authentication. Initially SASL/GSSAPI was usedto implement Kerberos and mutually authenticate users,their applications, and Hadoop services over the RPCconnections [7]. Hadoop also supports “Pluggable”Authentication for HTTP Web Consoles meaning thatimplementers of web applications and web consoles couldimplement their own authentication mechanism for HTTPconnections. This includes but was not limited to HTTPSPNEGO authentication. The Hadoop components support2128

Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131SASL Framework i.e. the RPC layer can be changed tosupport the SASL based mutual authentication viz. SASLDigest-MD5 authentication or SASL GSSAPI/Kerberosauthentication.MapReducesupports Kerberos authentication, SASL DigestMD-5 authentication, and also Delegation tokenauthentication on RPC connections. In HDFScommunications between the NameNode and DataNodes isover RPC connection and mutual Kerberos authenticationisperformed between them [15]. HBase supports SASLKerberos secure client authentication via RPC, HTTP. Hivesupports Kerberos and LDAP authentication for the userauthentication and authentication via Apache Knoxexplained in section V.Pig uses the user credentials to submit the job to Hadoop.So there is no need of any additional Kerberos securityauthentication required but before starting Pig the usershould authenticate with KDC and get a valid Kerberosticket [15]. Oozie provides user authentication to the Oozieweb services. It also provides Kerberos HTTP Simple andProtected GSSAPI Negotiation Mechanism (SPNEGO)authentication for web clients. SPNEGO protocol is usedwhen a client application wants to authenticate to a remoteserver, but is not sure of the authentication protocols to use.Zookeeper supports SASL Kerberos authentication on RPCconnections. Hue offers SPENGO authentication,LDAPauthentication, it now also supports SAML SSOauthentication[15].There are a number of data flows involved in Hadoopauthentication – Kerberos RPC authentication mechanismis used for users authentication, applications and HadoopServices, HTTP SPNEGO authentication is used for webconsoles, and the use of delegation tokens [10]. Delegationtoken is a two party authentication protocol used betweenuser and NameNode for authenticating users, it is simpleand more effective than three party protocol used byKerberos [7, 15].Oozie and HDFS,MapReduce supportsdelegation token.B. Authorization and ACLsAuthorization is a process of specifying access controlprivileges for user or system. In Hadoop, access controls isimplemented by using file-based permissions that followthe UNIX permissions model. Access control to files inHDFS could be enforced by the NameNode based on filepermissions and ACLs of users and groups. MapReduceprovides ACLs for job queues; that define which users orgroups can submit jobs to a queue and change queueproperties. Hadoop offers fine-grained authorization usingfile permissions in HDFS and resource level access controlusing ACLs for MapReduce and coarser grained accesscontrol at a service level[13].HBase offers userauthorization on tables, column families. The userauthorization is implemented using coprocessors.Coprocessors are like database triggers in HBase [15].They intercept any request to the table before and after,now we can use the Project Rhino [V] to extend HBasesupport for cell level ACLs. In Hive, authorization isimplemented using Apache Sentry [V].Pig providesauthorization using ACLs for job queues; Zookeeper alsooffers authorization using node ACLs.Hue provides accesscontrol via file system permission; it also offers ACLs forjob queue.Although Hadoop can be set up to perform access controlvia user and group permissions and Access Control Lists(ACLs), this may not be sufficient for every organization.Now-a-days many organizations use flexible and dynamicaccess control policies based on XACML and AttributeBased Access Control [10, 13].Hadoop can now beconfigured to support RBAC, ABAC access control usingsome third party (as discussed in this section and section V)framework or tool some of which are discussed in sectionV. Some of the Hadoop‘s components like HDFS can offerABAC using Apache Knox and also Hive can support rolebased access control using Apache Sentry. ZettasetOrchestration a product by Zettaset provides role basedaccess control support and enables Kerberos to beseamlessly integrated into hadoop ecosystem. [6, 15]TABLE I: ANALYSIS OF SECURITY gationtokensJob & HDFSHBaseHivePigOozieZookeeperHueSASLframework ,DelegationtokensKerberos, SASL(secure User uthentication at ase ACLs ontables, columns,familiesApachesentryACLs,ApacheSentryACLs and FSpermissionsACLsACLs and FSpermissionsEncryption ofdata at rest---AES, OSlevelThird partysolutionThird partysolutionThird partysolutionThird partysolutionN/AThird partysolutionEncryption ofdata in transitRPC – SASL,HTTPSRPC – SASL,Data transferprotocolSASL (secureRPC)Third partysolutionSASLSSL/TLSThird partysolutionHTTPSAudit TrailsYes (Baseaudit)Yes (Baseaudit)No (But Thirdparty solutioncan be used)Yes (Hivemetastore)Third partysolutionYes(services)Third partysolutionYes(Huelogs)www.ijcsit.com2129

Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131C. EncryptionEncryption ensures confidentiality and privacy of userinformation, and it secures the sensitive data inHadoop. Hadoop is a distributed system running on distinctmachines, which means that data must be transmitted overthe network on a regular basis, there is an increasing needof demand to move sensitive information into the Hadoopecosystem to generate valuable perceptions. Sensitive datawithin the cluster needs special kind of protection andshould be secured both at rest and in motion [10]. This dataneeds to be protected during the transfer to and from theHadoop system. The simple authentication and securitylayer (SASL) authentication framework is used forencrypting the data in motion in hadoop ecosystem. SASLsecurity gives guarantee of the data being exchangedbetween client and servers and make sure that, the data isnot readable by a “man-in-middle”.SASL supports variousauthentication mechanisms, for example, DIGEST-MD5,CRAM-MD5, etc. The data at rest can be protected in twoways: First, when file is stored in Hadoop, the complete filecan be encrypted first and then stored in Hadoop. In thisapproach, the data blocks in each DataNode can'tbedecrypted until we put all the blocks back and create theentire encrypted file. Second, to applying encryption to datablocks once they are loaded in Hadoop system [15].Hadoop supports encryption capability for various channelslike RPC, HTTP, and Data Transfer Protocol for data inmotion. Hadoop Crypto Codec framework and CryptoCodec Implementation have been introduced to supportdata at rest encryption. HDFS supports AES, OS levelencryption for data at rest.Zookeeper, Oozie, Hive, HBase,Pig, Huedon’t offer data at rest encryption solution but forthis components encryption can be implemented via customencryption techniques or third party tools like Gazzang’szNcryptor using crypto codec framework’s. File systemlevel security, and encryption and decryption of files can beperformed on the fly using eCryptfs and Gazzang’s zNcrypttools which are commercial security solution available forHadoop clusters [10, 13, 15].To protect data in transit and at rest, encryption andmasking techniques can be implemented. Tools such asIBM Optim and Dataguise provide data maskingforenterprise data [15]. Intel's distribution offers encryptionand compression of files [15]. Project Rhino enables blocklevel encryption similar to Dataguiseand Gazzang. [5]D. Audit TrailsHadoop cluster hosts sensitive information, security of thisinformation is utmost important for organizations to have asuccessful secure big data journey. There is always apossibility of occurrence of security breaches byunintended, unauthorized access or inappropriate access byprivileged users. [13] So to meet the security compliancerequirements, we need to audit the entire Hadoopecosystem on a periodic basis and deploy or implement asystem that does log monitoring.www.ijcsit.comHDFS and MapReduce provide base audit support. ApacheHive metastore maintains audit (who/when) information forHive interactions [13, 15]. Apache Oozie, the workflowengine, provides audit trail for services, workflowsubmission is maintained into Oozie log files. Hue alsosupports audit logs.For those Hadoop components whichdonot provide built-in audit logging, we can use audit logsmonitoring tools. Scribe and LogStash are open sourcetools that integrate into most big data environments, asnumbers of commercial products do. So one just need toneed to find a compatible tool, get it install, integrate itwith other systems like log management, and then actuallyreview the results,and what could went wrong. ClouderaNavigator by Cloudera is popular commercial tool thatprovides audit logging for big data environment. ionmanagement, logging, and auditing support. [6][15]V. SECURITY TECHNOLOGIES -SOLUTION FOR SECURINGHADOOPIn this we will look at overview of the various commercialand open source technologies that are available to addressthe various security aspects of big data Hadoop [15].A. Apache SentryApache sentry an open source project by Cloudera is anauthorization module for Hadoop that offers the granular,role-based authorization required to provide precise levelsof access to the right users and applications. It support forrole-based authorization, fine-grained authorization, andmulti-tenant administration [11][15].B. Apache KnoxThe Apache Knox Gateway is a system that provides asingle point of authentication and access for variousHadoop services in a cluster. It provides a perimetersecurity solution for Hadoop. The second advantage is itsupports various authentication and token verificationscenarios. It manages security across multiple clusters andversions of Hadoop. It also provides SSO solutions, andallows integrating other identity management solutionssuch as LDAP,Active Directory (AD), and SAMLbasedSSO and other SSO systems [9].C. Project RhinoProject Rhino provides an integrated end-to-end datasecurity solution to the Hadoop ecosystem. It provides atoken based authentication and SSO solution. It offersHadoop crypto codec framework and crypto codecimplementation to provide block level encryption for thedata stored in Hadoop. It supports key distribution andmanagement so that MR can decrypt data block andexecute the program as per requirement. It also enhancesthe security of HBase by offering cell level authenticationand transparent encryption for table stored in Hadoop. Itsupports audit logging framework for easy audit trails. [15]2130

Priya P. Sharma et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (2) , 2014, 2126-2131VI. CONCLUSIONIn Big Data Era, where data is accumulated from varioussources, security is a major concern (critical requirement)as there is no fixed source of data. With the Hadoopgaining larger acceptance within the industry, a naturalconcern over the security has spread. A growing need toaccept and assimilate these security solution andcommercial security features has surfaced. In this paper wehave tried to cover all the security solutionto secure theHadoop 13][14][15][16][17]REFERENCESCloud Security Alliance “Top Ten big Data Security and PrivacyChallenges”Tom White O’Reilly Yahoo! Press “Hadoop The definitive guide”Owen O’Malley, Kan Zhang, Sanjay Radia, Ram Marti, andChristopher Harrell “Hadoop Security Design”Mike Ferguson “Enterprise Information Protection - The Impact ofBig Data”Vormetric “Securing Big Data: Security Recommendations forHadoop and NoSQL Environments ,October 12, 2012”Zettaset “The Big Data Security Gap: Protecting the HadoopCluster”Devaraj Das, Owen O’Malley, Sanjay Radia, and Kan Zhang“Adding Security to Apache Hadoop”Seref SAGIROGLU and Duygu SINANC “Big Data: A ReviewCollaboration Technologies and Systems (CTS), 2013 InternationalConference ,May 2013“Horton works “Technical Preview for Apache Knox Gateway”Kevin T. Smith “Big Data Security : The Evolution of Hadoop’sSecurity Model”M. Tim Jones “Hadoop Security and Sentry”Victor L. Voydock and Stephen T. Kent “Security mechanisms inhigh-level network protocols. ACM Comput. Surv.1983”.Vinay Shukla s “Hadoop Security: Today and Tomorrow”MahadevSatyanarayanan “Integrating security in a large distributedsystem.ACM Trans. Comput. Syst., 1989”Sudheesh Narayana, Packt Publishing “Securing HadoopImplement robust end-to-end security for your Hadoop ecosystem”S. Singh and N. Singh, "Big Data Analytics", 2012 InternationalConference on Communication, Information & ComputingTechnology Mumbai India, IEEE, October 2011jeffhurtblog.com “three-vs-of-big-data-as-applied-conferences, July7,2012”www.ijcsit.com2131

Big Data Hadoop traditional security and also discuss weakness of the same, security threats, we have describe various security issues in Section III, Section IV we present our analysis of security solution for each of the hadoop components in tabular format and section V is also an analysis of security technologies used to secure Hadoop.