Security Issues For Cloud Computing

Transcription

International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010 39Security Issues forCloud ComputingKevin Hamlen, The University of Texas at Dallas, USAMurat Kantarcioglu, The University of Texas at Dallas, USALatifur Khan, The University of Texas at Dallas, USABhavani Thuraisingham, The University of Texas at Dallas, USAAbstractIn this paper, the authors discuss security issues for cloud computing and present a layered framework forsecure clouds and then focus on two of the layers, i.e., the storage layer and the data layer. In particular, theauthors discuss a scheme for secure third party publications of documents in a cloud. Next, the paper willconverse secure federated query processing with map Reduce and Hadoop, and discuss the use of secureco-processors for cloud computing. Finally, the authors discuss XACML implementation for Hadoop anddiscuss their beliefs that building trusted applications from untrusted components will be a major aspect ofsecure cloud computing.Keywords:Access Control, Authentication, Bottom-Up Design, Data Mining, Information Processing,Ontology Theory, Real-Time ISIntroductionThere is a critical need to securely store, manage,share and analyze massive amounts of complex(e.g., semi-structured and unstructured) datato determine patterns and trends in order toimprove the quality of healthcare, better safeguard the nation and explore alternative energy.Because of the critical nature of the applications,it is important that clouds be secure. The majorsecurity challenge with clouds is that the ownerof the data may not have control of where thedata is placed. This is because if one wants toexploit the benefits of using cloud computing,one must also utilize the resource allocationand scheduling provided by clouds. Therefore,we need to safeguard the data in the midst ofuntrusted processes.The emerging cloud computing modelattempts to address the explosive growth ofweb-connected devices, and handle massiveamounts of data. Google has now introducedthe MapReduce framework for processinglarge amounts of data on commodity hardware. Apache’s Hadoop distributed file system(HDFS) is emerging as a superior softwarecomponent for cloud computing combinedwith integrated parts such as MapReduce. TheDOI: 10.4018/jisp.2010040103Copyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

40 International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010need to augment human reasoning, interpreting, and decision-making abilities has resultedin the emergence of the Semantic Web, whichis an initiative that attempts to transform theweb from its current, merely human-readableform, to a machine-processable form. This inturn has resulted in numerous social networking sites with massive amounts of data to beshared and managed. Therefore, we urgentlyneed a system that can scale to handle a largenumber of sites and process massive amounts ofdata. However, state of the art systems utilizingHDFS and MapReduce are not sufficient dueto the fact that they do not provide adequatesecurity mechanisms to protect sensitive data.We are conducting research on secure cloudcomputing. Due to the extensive complexity ofthe cloud, we contend that it will be difficultto provide a holistic solution to securing thecloud, at present. Therefore, our goal is to makeincrement enhancements to securing the cloudthat will ultimately result in a secure cloud. Inparticular, we are developing a secure cloudconsisting of hardware (includes 800TB ofdata storage on a mechanical disk drive, 2400GB of memory and several commodity computers), software (includes Hadoop) and data(includes a semantic web data repository). Ourcloud system will: (a) support efficient storageof encrypted sensitive data, (b) store, manageand query massive amounts of data, (c) supportfine-grained access control and (d) supportstrong authentication. This paper describes ourapproach to securing the cloud. The organizationof this paper is as follows: In section 2, we willgive an overview of security issues for cloud.In section 3, we will discuss secure third partypublication of data in clouds. In section 4, wewill discuss how encrypted data may be queried. Section 5 will discuss Hadoop for cloudcomputing and our approach to secure queryprocesses with Hadoop. The paper is concludedin section 6.Security Issuesfor CloudsThere are numerous security issues for cloudcomputing as it encompasses many technologies including networks, databases, operatingsystems, virtualization, resource scheduling,transaction management, load balancing, concurrency control and memory management.Therefore, security issues for many of thesesystems and technologies are applicable tocloud computing. For example, the network thatinterconnects the systems in a cloud has to besecure. Furthermore, virtualization paradigmin cloud computing results in several securityconcerns. For example, mapping the virtualmachines to the physical machines has to becarried out securely. Data security involvesencrypting the data as well as ensuring thatappropriate policies are enforced for datasharing. In addition, resource allocation andmemory management algorithms have to besecure. Finally, data mining techniques maybe applicable to malware detection in clouds.We have extended the technologies andconcepts we have developed for secure gridto a secure cloud. We have defined a layeredframework for assured cloud computing consisting of the secure virtual machine layer, securecloud storage layer, secure cloud data layer,and the secure virtual network monitor layer(Figure 1). Cross cutting services are providedby the policy layer, the cloud monitoring layer,the reliability layer and the risk analysis layer.For the Secure Virtual Machine (VM)Monitor we are combining both hardware andsoftware solutions in virtual machines to handleproblems such as key logger examining XENdeveloped at the University of Cambridge andexploring security to meet the needs of ourapplications (e.g., secure distributed storageand data management). For Secure Cloud Storage Management, we are developing a storageinfrastructure which integrates resources frommultiple providers to form a massive virtualstorage system. When a storage node hosts thedata from multiple domains, a VM will be cre-Copyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010 41Figure 1. Layered framework for assured cloudated for each domain to isolate the informationand corresponding data processing. Since datamay be dynamically created and allocated tostorage nodes, we are investigating secure VMmanagement services including VM pool management, VM diversification management, andVM access control management. Hadoop andMapReduce are the technologies being used.For Secure Cloud Data Management, we havedeveloped secure query processing algorithmsfor RDF (Resource Description Framework)and SQL (HIVE) data in clouds with anXACML-based (eXtensible Access ControlMarkup Language) policy manager utilizingthe Hadoop/MapReduce Framework. For Secure Cloud Network Management, our goal isto implement a Secure Virtual Network Monitor (VNM) that will create end-to-end virtuallinks with the requested bandwidth, as well asmonitor the computing resources. Figure 2 illustrates the technologies we are utilizing foreach of the layers.This project is being carried out in closecollaboration with the AFOSR MURI projecton Assured Information Sharing and EOARDfunded research project on policy managementfor information sharing. We have completed arobust demonstration of secure query processing. We have also developed secure storagealgorithms and completed the design of XACML for Hadoop. Since Yahoo has come up witha secure Hadoop, we can now implement ourdesign. We have also developed access controland accountability for cloud.In this paper, we will focus only on someaspects of the secure cloud, namely aspects ofthe cloud storage and data layers. In particular,(i) we describe ways of efficiently storing thedata in foreign machines, (ii) querying encrypteddata, as much of the data on the cloud may beencrypted and (iii) secure query processing ofthe data. We are using Hadoop distributed filesystem for virtualization at the storage level andapplying security for Hadoop which includesan XACML implementation. In addition, weare investigating secure federated query processing on clouds over Hadoop. These effortswill be described in the subsequent sections.Subsequent papers will describe the design andimplementation of each of the layers.Copyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

42 International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010Figure 2. Layered framework for assured cloudThird Party SecureData PublicationApplied to CloudCloud computing facilitates storage of data ata remote site to maximize resource utilization.As a result, it is critical that this data be protected and only given to authorized individuals.This essentially amounts to secure third partypublication of data that is necessary for dataoutsourcing, as well as external publications.We have developed techniques for third partypublication of data in a secure manner. We assume that the data is represented as an XMLdocument. This is a valid assumption as many ofthe documents on the web are now representedas XML documents. First, we discuss the accesscontrol framework proposed in Bertino (2002)and then discuss secure third party publicationdiscussed in Bertino (2004).In the access control framework proposedin Bertino (2002), security policy is specifieddepending on user roles and credentials (seeFigure 3). Users must possess the credentialsto access XML documents. The credentialsdepend on their roles. For example, a professorhas access to all of the details of students whilea secretary only has access to administrativeinformation. XML specifications are used tospecify the security policies. Access is grantedfor an entire XML document or portions of thedocument. Under certain conditions, accesscontrol may be propagated down the XML tree.For example, if access is granted to theroot, it does not necessarily mean access isgranted to all the children. One may grant access to the XML schema and not to the documentinstances. One may grant access to certain portions of the document. For example, a professordoes not have access to the medical informationof students while he has access to student gradeand academic information. Design of a systemfor enforcing access control policies is alsodescribed in Bertino (2002). Essentially, thegoal is to use a form of view modification sothat the user is authorized to see the XML viewsas specified by the policies. More research needsto be done on role-based access control forXML and the semantic web.In Bertino (2004), we discuss the securepublication of XML documents (see Figure 4).The idea is to have untrusted third party publishers. The owner of a document specifies accesscontrol polices for the subjects. Subjects get thepolicies from the owner when they subscribe toa document. The owner sends the documentsto the Publisher. When the subject requests adocument, the publisher will apply the policiesCopyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010 43Figure 3. Access control frameworkrelevant to the subject and give portions of thedocuments to the subject. Now, since the publisher is untrusted, it may give false informationto the subject. Therefore, the owner will encryptvarious combinations of documents and policieswith his/her private key. Using Merkle signatureand the encryption techniques, the subject canverify the authenticity and completeness of thedocument (see Figure 4 for secure publishingof XML documents).In the cloud environment, the third partypublisher is the machine that stored the sensitivedata in the cloud. This data has to be protectedand the techniques we have discussed abovehave to be applied to that authenticity andcompleteness can be maintained.Figure 4. Secure third party publicationCopyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

44 International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010Encrypted DataStorage for CloudSince data in the cloud will be placed anywhere,it is important that the data is encrypted. We areusing secure co-processor as part of the cloudinfrastructure to enable efficient encryptedstorage of sensitive data. One could ask us thequestion: why not implement your software onhardware provided by current cloud computingsystems such as Open Cirrus? We have exploredthis option. First, Open Cirrus provides limitedaccess based on their economic model (e.g.,Virtual cash). Furthermore, Open Cirrus doesnot provide the hardware support we need (e.g.,secure co-processors). By embedding a secureco-processor (SCP) into the cloud infrastructure,the system can handle encrypted data efficiently(see Figure 5).Basically, SCP is a tamper-resistant hardware capable of limited general-purpose computation. For example, IBM 4758 Cryptographic Coprocessor (IBM) is a single-boardcomputer consisting of a CPU, memory andspecial-purpose cryptographic hardware contained in a tamper-resistant shell, certified tolevel 4 under FIPS PUB 140-1. When installedon the server, it is capable of performing localcomputations that are completely hidden fromthe server. If tampering is detected, then thesecure co-processor clears the internal memory. Since the secure coprocessor is tamper-resistant, one could be tempted to run the entiresensitive data storage server on the secure coprocessor. Pushing the entire data storagefunctionality into a secure co-processor is notfeasible due to many reasons.First of all, due to the tamper-resistantshell, secure co-processors have usually limitedmemory (only a few megabytes of RAM anda few kilobytes of non-volatile memory) andcomputational power (Smith, 1999). Performance will improve over time, but problemssuch as heat dissipation/power use (which mustbe controlled to avoid disclosing processing)will force a gap between general purposes andsecure computing. Another issue is that thesoftware running on the SCP must be totallytrusted and verified. This security requirementimplies that the software running on the SCPshould be kept as simple as possible. So howdoes this hardware help in storing large sensitive data sets? We can encrypt the sensitivedata sets using random private keys and toalleviate the risk of key disclosure, we can usetamper-resistant hardware to store some of theFigure 5. Parts of the proposed instrumentCopyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010 45encryption/decryption keys (i.e., a master keythat encrypts all other keys). Since the keys willnot reside in memory unencrypted at any time,an attacker cannot learn the keys by taking thesnapshot of the system. Also, any attempt bythe attacker to take control of (or tamper with)the co-processor, either through software orphysically, will clear the co-processor, thuseliminating a way to decrypt any sensitiveinformation. This framework will facilitate (a)secure data storage and (b) assured informationsharing. For example, SCPs can be used for privacy preserving information integration whichis important for assured information sharing.We have conducted research on querying encrypted data as well as secure multipartcomputation (SMC). With SMC protocols, oneknows about his own data but not his partner’sdata since the data is encrypted. However, operations can be performed on the encrypted dataand the results of the operations are availablefor everyone, say, in the coalition to see. Onedrawback of SMC is the high computation costs.However, we are investigating more efficientways to develop SMC algorithms and how thesemechanisms can be applied to a cloud.Secure Query Processingwith HadoopOverview of HadoopA major part of our system is HDFS which isa distributed Java-based file system with thecapacity to handle a large number of nodesstoring petabytes of data. Ideally a file size isa multiple of 64 MB. Reliability is achieved byreplicating the data across several hosts. Thedefault replication value is 3 (i.e., data is storedon three nodes). Two of these nodes reside onthe same rack while the other is on a differentrack. A cluster of data nodes constructs the filesystem. The nodes transmit data over HTTP andclients’ access data using a web browser. Datanodes communicate with each other to regulate,transfer and replicate data.HDFS architecture is based on the MasterSlave approach (Figure 6). The master is calleda Namenode and contains metadata. It keepsthe directory tree of all files and tracks whichdata is available from which node across thecluster. This information is stored as an imagein memory. Data blocks are stored in Datanodes.The namenode is the single point of failure asit contains the metadata. So, there is optionalsecondary Namenode that can be setup on anymachine. The client accesses the Namenodeto get the metadata of the required file. Aftergetting the metadata, the client directly talks tothe respective Datanodes in order to get data orto perform IO actions (Hadoop). On top of thefile systems there exists the map/reduce engine.This engine consists of a Job Tracker. The client applications submit map/reduce jobs to thisengine. The Job Tracker attempts to place thework near the data by pushing the work out tothe available Task Tracker nodes in the cluster.Inadequacies of HadoopCurrent systems utilizing Hadoop have thefollowing limitations:(1) No facility to handle encrypted sensitivedata: Sensitive data ranging from medicalrecords to credit card transactions need tobe stored using encryption techniques foradditional protection. Currently, HDFSdoes not perform secure and efficient queryprocessing over encrypted data.(2) Semantic Web Data Management: Thereis a need for viable solutions to improvethe performance and scalability of queriesagainst semantic web data such as RDF(Resource Description Framework). Thenumber of RDF datasets is increasing. Theproblem of storing billions of RDF triplesand the ability to efficiently query them isyet to be solved (Muys, 2006; Teswanich,2007; Ramanujam, 2009). At present, thereis no support to store and retrieve RDFdata in HDFS.(3) No fine-grained access control: HDFSdoes not provide fine-grained access control. There is some work to provide accesscontrol lists for HDFS (Zhang, 2009). ForCopyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

46 International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010Figure 6. Hadoop Distributed File System (HDFS architecture)many applications such as assured information sharing, access control lists are notsufficient and there is a need to supportmore complex policies.(4) No strong authentication: A user whocan connect to the JobTracker can submitany job with the privileges of the accountused to set up the HDFS. Future versions ofHDFS will support network authenticationprotocols like Kerberos for user authentication and encryption of data transfers(Zhang, 2009). However, for some assuredinformation sharing scenarios, we will needpublic key infrastructures (PKI) to providedigital signature support.System DesignWhile the secure co-processors can provide thehardware support to query and store the data,we need to develop a software system to store,query, and mine the data. More and more applications are now using semantic web data suchas XML and RDF due to their representationpower especially for web data management.Therefore, we are exploring ways to securelyquery semantic web data such as RDF data onthe cloud. We are using several software toolsthat are available to help us in the processincluding the following:Jena: Jena is a framework which is widely usedfor solving SPARQL queries over RDFdata (Jena). But the main problem withJena is scalability. It scales in proportionto the size of main-memory. It does nothave distributed processing. However, wewill be using Jena in the initial stages ofour preprocessing steps.Pellet: We use Pellet to reason at various stages.We do real-time query reasoning usingpellet libraries (Pellet) coupled withHadoop’s map-reduce functionalities.Pig Latin: Pig Latin is a scripting languagewhich runs on top of Hadoop (Gates,2009). Pig is a platform for analyzinglarge data sets. Pig’s language, Pig Latin,facilitates sequence of data transformations such as merging data sets, filteringthem, and applying functions to recordsor groups of records. It comes with manybuilt-in functions, but we can also createour own user-defined functions to dospecial-purpose processing. Using thisscripting language, we will avoid writing our own map-reduce code; we willrely on Pig Latin’s scripting power thatCopyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010 47Figure 7. System architecture for SPARQL query optimizationwill automatically generate script codeto map-reduce code.Mahout, Hama: These are open source datamining and machine learning packagesthat already augment Hadoop (Mahout)(Hama) (Moretti, 2008).Our approach consists of processingSPARQL queries securely over Hadoop.SPARQL is a query language used to queryRDF data (W3C, SPARQL). The software partwe will develop is a framework to query RDFdata distributed over Hadoop (Newman, 2008;McNabb, 2007). There are a number of stepsto preprocess and query RDF data (see Figure7). With this proposed part, researchers canobtain results to optimize query processing ofmassive amounts of data. Below we discuss thesteps involved in the development of this part.Pre-processing: Generally, RDF data is in XMLformat (see Lehigh University Benchmark [LUBM] RDF data). In order toexecute a SPARQL query, we proposesome data pre-processing steps and storethe pre-processed data into HDFS. Wehave an N-triple Convertor module whichconverts RDF/XML format of data intoN-triple format as this format is moreunderstandable. We will use Jena framework as stated earlier, for this conversionpurpose. In Predicate Based File Splittermodule, we split all N-triple format filesbased on the predicates. Therefore, thetotal number of files for a dataset is equalto the number of predicates in the ontology/taxonomy. In the last module of thepre-processing step, we further dividepredicate files on the basis of the type ofobject it contains. So, now each predicatefile has specific types of objects in it. Thisis done with the help of the Pellet library.This pre-processed data is stored intoHadoop.Query Execution and Optimization: We aredeveloping a SPARQL query executionand optimization module for Hadoop. Asour storage strategy is based on predicatesplits, first, we will look at the predicatespresent in the query. Second, rather thanlooking at all of the input files, we willlook at a subset of the input files that arematched with predicates. Third, SPARQLqueries generally have many joins in themand all of these joins may not be possibleto perform in a single Hadoop job. Therefore, we will devise an algorithm thatdecides the number of jobs required foreach kind of query. As part of optimiza-Copyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

48 International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010tion, we will apply a greedy strategy andcost-based optimization to reduce queryprocessing time. An example of greedystrategy is to cover the maximum number of possible joins in a single job. Forthe cost model, the join to be performedfirst is based on summary statistics (e.g.,selectivity factor of a bounded variable,join triple selectivity factor for three triplepatterns. For example, consider a queryfor LUBM dataset: “List all persons whoare alumni of a particular university.” InSPARQL:PREFIX rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# PREFIX ub: http://www.lehigh.edu/ zhp2/2004/0401/univ-bench.owl# SELECT ?X WHERE {?X rdf:type ub:Person . http://www.University0.edu ub:hasAlumnus ?X }The query optimizer will take this queryinput and decide a subset of input files to lookat based on predicates that appear in the query.Ontology and pellet reasoner will identifythree input files (underGraduateDegreeFrom,masterDegreeFrom and DoctoraldegreeFrom)related to predicate, “hasAlumns”. Next, fromtype file we filter all the records whose objectsare a subclass of Person using the pellet library.From these three input files (underGraduateDegreeFrom, masterDegreeFrom and DoctoraldegreeFrom) the optimizer filters out tripleson the basis of http://www.University0.edu as required in the query. Finally, the optimizerdetermines the requirement for a single job forthis type of query and then the join is carriedout on the variable X in that job.With respect to secure query processing, weare investigating two approaches. One is rewriting the query in such a way that the policies areenforced in an appropriate manner. The secondis query modification where the policies areused in the “where” clause to modify the query.Integrate SUN XACMLImplementation into HDFSCurrent Hadoop implementations enforce avery coarse-grained access control policy thatpermits or denies a principal access to essentially all system resources as a group withoutdistinguishing amongst resources. For example,users who are granted access to the Namenode(see Figure 6) may execute any program onany client machine, and all client machineshave read and write access to all files storedon all clients. Such coarse-grained security isclearly unacceptable when data, queries, andthe system resources that implement themare security-relevant, and when not all usersand processes are fully trusted. Current work(Zhang, 2009) addresses this by implementing standard access control lists for Hadoop toconstrain access to certain system resources,such as files; however, this approach has thelimitation that the enforced security policy isbaked into the operating system and thereforecannot be easily changed without modifyingthe operating system. We are enforcing moreflexible and fine-grained access control policieson Hadoop by designing an In-lined ReferenceMonitor implementation of Sun XACML.XACML (Moses, 2005) is an OASIS standardfor expressing a rich language of access controlpolicies in XML. Subjects, objects, relations,and contexts are all generic and extensible inXACML, making it well-suited for a distributedenvironment where many different sub-policiesmay interact to form larger, composite, systemlevel policies. An abstract XACML enforcementmechanism is depicted in Figure 8. Untrustedprocesses in the framework access securityrelevant resources by submitting a request tothe resource’s Policy Enforcement Point (PEP).The PEP reformulates the request as a policyquery and submits it to a Policy Decision Point(PDP). The PDP consults any policies relatedto the request to answer the query. The PEPeither grants or denies the resource requestbased on the answer it receives. While thePEP and PDP components of the enforcementmechanism are traditionally implemented atCopyright 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Globalis prohibited.

International Journal of Information Security and Privacy, 4(2), 39-51, April-June 2010 49Figure 8. XACML enforcement architecturethe level of the operating system or as trustedsystem libraries, we propose to achieve greaterflexibility by implementing them in our systemas In-lined Reference Monitors (IRM’s). IRM’simplement runtime security checks by in-liningthose checks directly into the binary code ofuntrusted processes. This has the advantage thatthe policy can be enforced without modifyingthe operating system or system libraries. IRMpolicies can additionally constrain programoperations that might be difficult or impossible to intercept at the operating system level.For example, memory allocations in Java areimplemented as Java bytecode instructions thatdo not call any external program or library. Enforcing a fine-grained memory-bound policy asa traditional reference monitor in Java thereforerequires modifying the Java virtual machine orJIT-compiler. In contrast, an IRM can identifythese security-relevant instructions and injectappropriate guards directly into the untrustedcode to enforce the policy.Finally, IRM’s can efficiently enforcehistory-based security policies, rather thanmerely policies that constrain individual security-relevant events. For example, our past work(Jones, 2009) has used IRMs to enforce fairnesspolicies that require untrusted applications toshare as much data as they request. This preventsprocesses from effecting denial of service attacks based on freeloading behavior. The codeinjected into the untrusted binary by the IRMconstrains each program operation based on thepast history of program operations rather than inisolation. This involves injecting security statevariables and counters i

cloud computing. For example, the network that interconnects the systems in a cloud has to be secure. Furthermore, virtualization paradigm in cloud computing results in several security concerns. For example, mapping the virtual machines to the physical machin