Big Data In Cloud Computing: Features And Issues

Transcription

Big Data in Cloud Computing: features and issuesPedro Caldeira Neves1,2, Bradley Schmerl1, Jorge Bernardino1,2,3 and Javier Cámara11Carnegie Mellon University Institute for Software Research, Pittsburgh, PA 15213, U.S.A– Superior Institute of Engineering of Coimbra Polytechnic of Coimbra, 3030-190 Coimbra, Portugal3CISUC – Centre of Informatics and Systems of the University of Coimbra FCTUC – University of Coimbra, 3030-290Coimbra, Portugal{pedrofilipeneves, schmerl }@gmail.com, jorge@isec.pt, jcmoreno@ cs.cmu.edu2ISECKeywords:big data, cloud computing, big data issues.Abstract:The term big data arose under the explosive increase of global data as a technology that is able to store andprocess big and varied volumes of data, providing both enterprises and science with deep insights over itsclients/experiments. Cloud computing provides a reliable, fault-tolerant, available and scalable environmentto harbour big data distributed management systems. Within the context of this paper we present an overviewof both technologies and cases of success when integrating big data and cloud frameworks. Although big datasolves much of our current problems it still presents some gaps and issues that raise concern and needimprovement. Security, privacy, scalability, data governance policies, data heterogeneity, disaster recoverymechanisms, and other challenges are yet to be addressed. Other concerns are related to cloud computing andits ability to deal with exabytes of information or address exaflop computing efficiently. This paper presentsan overview of both cloud and big data technologies describing the current issues with these technologies.1INTRODUCTIONIn recent years, there has been an increasingdemand to store and process more and more data, indomains such as finance, science, and government.Systems that support big data, and host them usingcloud computing, have been developed and usedsuccessfully (Hashem et al., 2014) .Whereas big data is responsible for storing andprocessing data, cloud provides a reliable, faulttolerant, available and scalable environment so thatbig data systems can perform (Hashem et al., 2014).Big data, and in particular big data analytics, areviewed by both business and scientific areas as a wayto correlate data, find patterns and predict new trends.Therefore there is a huge interest in leveraging thesetwo technologies, as they can provide businesses witha competitive advantage, and science with ways toaggregate and summarize data from experiments suchas those performed at the Large Hadron Collider(LHC).To be able to fulfil the current requirements, bigdata systems must be available, fault tolerant, scalableand elastic.In this paper we describe both cloud computingand big data systems, focusing on the issues yet to beaddressed. We particularly discuss security concernswhen hiring a big data vendor: data privacy, datagovernance, and data heterogeneity; disaster recoverytechniques; cloud data uploading methods; and howcloud computing speed and scalability poses aproblem regarding exaflop computing.Despite some issues yet to be improved, wepresent two examples that show how cloudcomputing and big data can work well together.Our contributions to the current state-of-the-art isdone by providing an overview over the issues toimprove or have yet to be addressed in bothtechnologies.The remainder of this paper is organized asfollows: Section 2 provides a general overview of bigdata and cloud computing; Section 3 discusses andpresents two examples that show how big data and

cloud computing work well together and especiallyhow hiring a big data vendor may be a good choice sothat organizations can avoid IT worries; Section 4discusses the several issues to address in cloudcomputing and big data systems; and Section 5presents the discussion, conclusions and future work.2BIG DATA & CLOUDCOMPUTINGThe concept of big data became a major force ofinnovation across both academics and corporations.The paradigm is viewed as an effort to understand andget proper insights from big datasets (big dataanalytics), providing summarized information overhuge data loads. As such, this paradigm is regardedby corporations as a tool to understand their clients,to get closer to them, find patterns and predict trends.Furthermore, big data is viewed by scientists as amean to store and process huge scientific datasets.This concept is a hot topic and is expected to continueto grow in popularity in the coming years.Although big data is mostly associated with thestorage of huge loads of data it also concerns ways toprocess and extract knowledge from it (Hashem et al.,2014). The five different aspects used to describe bigdata (commonly referred to as the five “V”s) areVolume, Variety, Velocity, Value and Veracity (Sakr& Gaber, 2014): Volume describes the size of datasets that a bigdata system deals with. Processing and storing bigvolumes of data is rather difficult, since itconcerns: scalability so that the system can grow;availability, which guarantees access to data andways to perform operations over it; and bandwidthand performance. Variety concerns the different types of data fromvarious sources that big data frameworks have todeal with. Velocity concerns the different rates at which datastreams may get in or out the system and providesan abstraction layer so that big data systems canstore data independently of the incoming oroutgoing rate. Value concerns the true value of data (i.e., thepotential value of the data regarding theinformation they contain). Huge amounts of dataare worthless unless they provide value. Veracity refers to the trustworthiness of the data,addressing data confidentiality, integrity, andavailability. Organizations need to ensure thatdata as well as the analyses performed on the dataare correct.Cloud computing is another paradigm whichpromises theoretically unlimited on-demand servicesto its users. Cloud’s ability to virtualize resourcesallows abstracting hardware, requiring littleinteraction with cloud service providers and enablingusers to access terabytes of storage, high processingpower, and high availability in a pay-as-you-gomodel (González-Martínez et al., 2015). Moreover, ittransfers cost and responsibilities from the user to thecloud provider, boosting small enterprises to whichgetting started in the IT business represents a largeendeavour, since the initial IT setup takes a big effortas the company has to consider the total cost ofownership (TCO), including hardware expenses,software licenses, IT personnel and infrastructuremaintenance. Cloud computing provides an easy wayto get resources on a pay-as-you-go model, offeringscalability and availability, meaning that companiescan easily negotiate resources with the cloud provideras required. Cloud providers usually offer threedifferent basic services: Infrastructure as a Service(IaaS); Platform as a Service (PaaS); and Software asa Service (SaaS): IaaS delivers infrastructure, which means storage,processing power, and virtual machines. Thecloud provider satisfies the needs of the client byvirtualizing resources according to the servicelevel agreements (SLAs); PaaS is built atop of IaaS and allows users todeploy cloud applications created using theprogramming and run-time environmentssupported by the provider. It is at this level thatbig data DBMS are implemented; SaaS is one of the most known cloud models andconsists of applications running directly in thecloud provider;These three basic services are closely related:SaaS is developed over PaaS and ultimately PaaS isbuilt atop of IaaS.From the general cloud services other servicessuch as Database as a Service (DBaaS) (Oracle,2012), BigData as a Service (BDaaS) and Analyticsas a Service (AaaS) arose.Since the cloud virtualizes resources in an ondemand fashion, it is the most suitable and compliantframework for big data processing, which throughhardware virtualization creates a high processingpower environment for big data.3 BIG DATA IN THE CLOUDStoring and processing big volumes of datarequires scalability, fault tolerance and availability.

Cloud computing delivers all these through hardwarevirtualization. Thus, big data and cloud computing aretwo compatible concepts as cloud enables big data tobe available, scalable and fault tolerant.Business regard big data as a valuable businessopportunity. As such, several new companies such asCloudera, Hortonworks, Teradata and many others,have started to focus on delivering Big Data as aService (BDaaS) or DataBase as a Service (DBaaS).Companies such as Google, IBM, Amazon andMicrosoft also provide ways for consumers toconsume big data on demand. Next, we present twoexamples, Nokia and RedBus, which discuss thesuccessful use of big data within cloud environments.3.1 NokiaNokia was one of the first companies tounderstand the advantage of big data in cloudenvironments (Cloudera, 2012). Several years ago,the company used individual DBMSs toaccommodate each application requirement.However, realizing the advantages of integrating datainto one application, the company decided to migrateto Hadoop-based systems, integrating data within thesame domain, leveraging the use of analyticsalgorithms to get proper insights over its clients. AsHadoop uses commodity hardware, the cost perterabyte of storage was cheaper than a traditionalRDBMS (Cloudera, 2012).Since Cloudera Distributed Hadoop (CDH)bundles the most popular open source projects in theApache Hadoop stack into a single, integratedpackage, with stable and reliable releases, it embodiesa great opportunity for implementing Hadoopinfrastructures and transferring IT and technicalconcerns onto the vendors’ specialized teams. Nokiaregarded Big Data as a Service (BDaaS) as anadvantage and trusted Cloudera to deploy a Hadoopenvironment that copes with its requirements in ashort time frame. Hadoop, and in particular CDH,strongly helped Nokia to fulfil their needs (Cloudera,2012).3.2 RedBusRedBus is the largest company in Indiaspecialized in online bus ticket and hotel booking.This company wanted to implement a powerful dataanalysis tool to gain insights over its bus bookingservice (Kumar, 2006). Its datasets could easilystretch up to 2 terabytes in size. The applicationwould have to be able to analyse booking andinventory data across hundreds of bus operatorsserving more than 10.000 routes. Furthermore, thecompany needed to avoid setting up and maintaininga complex in-house infrastructure.At first, RedBus considered implementing inhouse clusters of Hadoop servers to process data.However they soon realized it would take too muchtime to set up such a solution and that it would requirespecialized IT teams to maintain such infrastructure.The company then regarded Google bigQuery as theperfect match for their needs, allowing them to: Know how many times consumers tried to find anavailable seat but were unable to do it due busoverload; Examine decreases in bookings; Quickly identify server problems by analysingdata related to server activity;Moving towards big data brought RedBusbusiness advantages. Google bigQuery armedRedBus with real-time data analysis capabilities at20% of the cost of maintaining a complex Hadoopinfrastructure (Kumar, 2006).As supported by Nokia and RedBus examples,switching towards big data enables organizations togain competitive advantage. Additionally, BDaaSprovided by big data vendors allows companies toleave the technical details for big data vendors andfocus on their core business needs.4 BIG DATA ISSUESAlthough big data solves many current problemsregarding high volumes of data, it is a constantlychanging area that is always in development and thatstill poses some issues. In this section we presentsome of the issues not yet addressed by big data andcloud computing.As the amount of data grows at a rapid rate,keeping all data is physically cost-ineffective.Therefore, corporations must be able to createpolicies to define the life cycle and the expiration dateof data (data governance). Moreover, they shoulddefine who accesses and with what purpose clients’data is accessed. As data moves to the cloud, securityand privacy become a concern that is the subject ofbroad research.Big data DBMSs typically deal with lots of datafrom several sources (variety), and as suchheterogeneity is also a problem that is currently understudy. Other issues currently being investigated aredisaster recovery, how to easily upload data onto thecloud, and Exaflop computing.

Within this section we provide an overview overthese problems.4.1 SecurityCloud computing and big data security is a currentand critical research topic (Popović & Hocenski,2015). This problem becomes an issue to corporationswhen considering uploading data onto the cloud.Questions such as who is the real owner of the data,where is the data, who has access to it and what kindof permissions they have are hard to describe.Corporations that are planning to do business with acloud provider should be aware and ask the followingquestions:a) Who is the real owner of the data and who hasaccess to it?The cloud provider’s clients pay for a service andupload their data onto the cloud. However, towhich one of the two stakeholders does datareally belong? Moreover, can the provider usethe client’s data? What level of access has to itand with what purposes can use it? Can the cloudprovider benefit from that data?In fact, IT teams responsible for maintainingthe client’s data must have access to dataclusters. Therefore, it is in the client’s bestinterest to grant restricted access to data tominimize data access and guarantee that onlyauthorized personal access its data for a validreason.These questions seem easy to respond to,although they should be well clarified beforehiring a service. Most security issues usuallycome from inside of the organizations, so it isreasonable that companies analyse all data accesspolicies before closing a contract with a cloudprovider.b) Where is the data?Sensitive data that is considered legal in onecountry may be illegal in another country,therefore, for the sake of the client, there shouldbe an agreement upon the location of data, as itsdata may be considered illegal in some countriesand lead to prosecution.The problems to these questions are based uponagreements (Service Level Agreements – SLAs),however, these must be carefully checked in order tofully understand the roles of each stakeholder andwhat policies do the SLAs cover and not coverconcerning the organization’s data. This is typicallysomething that must be well negotiated.Concerning limiting data accesses, (Tu et al.,2013) and (Popa et al., 2011) came up with aneffective way to encrypt data and run analyticalqueries over encrypted data. This way, data access isno longer a problem since both data and queries areencrypted. Nevertheless, encryption comes with acost, which often means higher query processingtimes.4.2 PrivacyThe harvesting of data and the use of analyticaltools to mine information raises several privacyconcerns. Ensuring data security and protectingprivacy has become extremely difficult asinformation is spread and replicated around the globe.Analytics often mine users’ sensitive informationsuch as their medical records, energy consumption,online activity, supermarket records etc. Thisinformation is exposed to scrutiny, raising concernsabout profiling, discrimination, exclusion and loss ofcontrol (Tene & Polonetsky, 2012). Traditionally,organizations used various methods of deidentification (anonymization or encryption of data)to distance data from real identities. Although, inrecent years it was proved that even when data isanonymized, it can still be re-identified and attributedto specific individuals (Tene & Polonetsky, 2012). Away to solve this problem was to treat all data aspersonally identifiable and subject to a regulatoryframework. Although, doing so might discourageorganizations from using de-identification methodsand, therefore, increase privacy and security risks ofaccessing data.Privacy and data protection laws are premised onindividual control over information and on principlessuch as data and purpose minimization and limitation.Nevertheless, it is not clear that minimizinginformation collection is always a practical approachto privacy. Nowadays, the privacy approaches whenprocessing activities seem to be based on user consentand on the data that individuals deliberately provide.Privacy is undoubtedly an issue that needs furtherimprovement as systems store huge quantities ofpersonal information every day.4.3 HeterogeneityBig data concerns big volumes of data but alsodifferent velocities (i.e., data comes at different ratesdepending on its source output rate and networklatency) and great variety. The latter comprehendsvery large and heterogeneous volumes of data comingfrom several autonomous sources. Variety is one ofthe “major aspects of big data characterization”(Majhi & Shial, 2015) which is triggered by the beliefthat storing all kinds of data may be beneficial to bothscience and business.

Data comes to big data DBMS at differentvelocities and formats from various sources. This isbecause different information collectors prefer theirown schemata or protocols for data recording, and thenature of different applications also result in diversedata representations (Wu et al., 2014). Dealing withsuch a wide variety of data and different velocity ratesis a hard task that Big Data systems must handle. Thistask is aggravated by the fact that new types of fileare constantly being created without any kind ofstandardization. Though, providing a consistent andgeneral way to represent and explore complex andevolving relationships from this data still poses achallenge.4.4 Data GovernanceThe belief that storage is cheap, and its cost islikely to decline further, is true regarding hardwareprices. However, a big data DBMS does also concernother expenses such as infrastructure maintenance,energy, and software licenses (Tallon, 2013). Allthese expenses combined comprise the total cost ofownership (TCO), which is estimated to be seventimes higher than the hardware acquisition costs.Regarding that the TCO increases in directproportion to the growth of big data, this growth mustbe strictly controlled. Recall that the “Value” (one ofbig data Vs) stands to ensure that only valuable datais stored, since huge amounts of data are useless ifthey comprise no value.Data Governance came to address this problem bycreating policies that define for how long data isviable. The concept consists of practices andorganizational polices that describe how data shouldbe managed through its useful economic life cycle.These practices comprise three different categories:1.2.3.Structural practices identify key IT and non-ITdecision makers and their respective roles andresponsibilities regarding data ownership, valueanalysis and cost management (MorganKaufmann, 2013).Operational practices consist of the way datagovernance policies are applied. Typically, thesepolicies span a variety of actions such as datamigration, data retention, access rights, costallocation and backup and recovery (Tallon,2013).Relational practices formally describe the linksof the CIO, business managers and data users interms of knowledge sharing, value analysis,education, training and strategic IT planning.Data Governance is a general term that applies toorganizations with huge datasets, which definespolicies to retain valuable data as well as to managedata accesses throughout its life cycle. It is an issue toaddress carefully. If governance policies are notenforced, it is most likely that they are not followed.Although, there are limits to how much value datagovernance can bring, as beyond a certain pointstricter data governance can have counterproductiveeffects.4.5 Disaster RecoveryData is a very valuable business and losing datawill certainly result in losing value. In case ofemergency or hazardous accidents such asearthquakes, floods and fires, data losses need to beminimal. To fulfil this requirement, in case of anyincident, data must be quickly available with minimaldowntime and loss. However, although this is a veryimportant issue, the research in this particular area isrelatively low (Subashini & Kavitha, 2011), (Wood etal., 2010), (Chang, 2015).For big corporations it is imperative to define adisaster recovery plan – as part of the data governanceplan – that not only relies on backups to reset data butalso in a set of procedures that allow quickreplacement of the lost servers (Chang, 2015).From a technical perspective, the work describedin (Chang, 2015) presents a good methodology,proposing a “multi-purpose approach, which allowsdata to be restored to multiple sites with multiplemethods”, ensuring a recovery percentage of almost100%. The study also states that usually, datarecovery methods use what they call a “single-basketapproach”, which means there is only one destinationfrom which to secure the restored data.As the loss of data will potentially result in theloss of money, it is important to be able to respondefficiently to hazardous incidents. Successfullydeploying big data DBMSs in the cloud and keepingit always available and fault-tolerant may stronglydepend on disaster recovery mechanisms.4.6 Other problemsThe current state of the art of cloud computing,big data, and big data platforms in particular, promptssome other concerns. Within this section we discussdata transference onto the cloud; Exaflop computing,which presents a major concern nowadays; andscalability and elasticity issues in cloud computingand big data:a) Transferring data onto a cloud is a very slowprocess and corporations often choose to physicallysend hard drives to the data centres so that data canbe uploaded. However, this is neither the most

practical nor the safest solution to upload data ontothe cloud. Through the years there has been an effortto improve and create efficient data uploadingalgorithms to minimize upload times and provide asecure way to transfer data onto the cloud (Zhang, etal. 2013), however, this process still remains a majorbottleneck.b) Exaflop computing (Geller, 2011), (Schilling,2014) is one of today’s problems that is subject ofmany discussions. Today’s supercomputers andclouds can deal with petabyte data sets, however,dealing with exabyte size datasets still raises lots ofconcerns, since high performance and high bandwidthis required to transfer and process such huge volumesof data over the network. Cloud computing may notbe the answer, as it is believed to be slower thansupercomputers since it is restrained by the existentbandwidth and latency. High performance computers(HPC) are the most promising solutions, however theannual cost of such a computer is tremendous.Furthermore, there are several problems in designingexaflop HPCs, especially regarding efficient powerconsumption. Here, solutions tend to be more GPUbased instead of CPU based. There are also problemsrelated to the high degree of parallelism neededamong hundred thousands of t of big data and analytics which posesanother problem yet to resolve.c) Scalability and elasticity in cloud computingand in particular regarding big data managementsystems is a theme that needs further research asthe current systems hardly handle data peaksautomatically. Most of the time, scalability istriggered manually rather than automatically andthe state-of-the-art of automatic scalable systemsshows that most algorithms are reactive orproactive and frequently explore scalability fromthe perspective of better performance. However,a proper scalable system would allow bothmanual and automatic reactive and proactivescalability based on several dimensions such assecurity, workload rebalance (i.e.: the need torebalance workload) and redundancy (whichwould enable fault tolerance and availability).Moreover, current data rebalance algorithms arebased on histogram building and loadequalization (Mahesh et al., 2014). The latterensures an even load distribution to each server.However, building histograms from eachserver’s load is time and resource expensive andfurther research is being conducted on this fieldto improve these algorithms.4.7 Research challengesAs discussed in Section 3, cloud and big datatechnologies work very well together. Even thoughthe partnership between these two technologies havebeen established, both still pose some challenges.Table 1 summarizes the issues of big data andcloud computing nowadays. The first columnspecifies the existing issues whereas the seconddescribes the existing solutions and the remainingpresent the advantages and disadvantages of eachsolution.Concerning the existing problems, we definesome of the possible advances in the next few years: Security and Privacy can be resolved using dataencryption. However, a new generation ofsystems must ensure that data is accessed quicklyand that encryption does not affect processingtimes so badly; Big Data variety can be addressed by using datastandardization. This, we believe, is the next stepto minimize the impact of heterogeneity; Data governance and data recovery plans aredifficult to manage and implement, but as BigData become a de facto technology, companiesare starting to understand the need of such plans.; New and secure QoS (quality of service) baseddata uploading mechanisms may be the answer toease data uploading onto the cloud; Exaflop computing is a major challenge thatinvolves governments funding and which is in itsbest interest. The best solutions so far use HPCsand GPUs; Scalability and elasticity techniques exist and arebroadly used by several Big Data vendors such asAmazon and Microsoft. The major concern reliesupon developing fully automatic reactive andproactive systems that are capable of dealing withload requirements automatically.CONCLUSIONSWith data increasing on a daily base, big datasystems and in particular, analytic tools, have becomea major force of innovation that provides a way tostore, process and get information over petabytedatasets. Cloud environments strongly leverage bigdata solutions by providing fault-tolerant, scalableand available environments to big data systems.

Table 1. Big data issuesIssuesExistent solutionsAdvantagesDisadvantagesSecurityBased on SLAs and dataEncryptionData is encryptedQuerying encrypted data is timeconsumingPrivacy-De-identification-User consentProvides a reasonableprivacy or transfersresponsibility to the userIt was proved that most deidentification mechanisms can bereverse engineeredOne of the big datasystems' characteristics isthe ability to deal withdifferent data coming atdifferent velocitiesThe major types of data arecovered upIt is difficult to handle such variety ofdata and such different velocitiesData governancedocuments-Specify the way data ishandled;-Specify data accesspolicies;-Role specification;-Specify data life cycle-The data life cycle is not easy todefine;-Enforcing data governance policiesso much can lead tocounterproductive effectsRecovery plansSpecify the data recoverylocations and proceduresNormally there is only onedestination from which to secure data-Send HDDs to the cloudprovider-Upload data through theInternetPhysically sending the datato the cloud provider isquicker than uploading databut it is much moreunsecurePhysically sending data to the cloudprovider is dangerous as HDDs cansuffer damage from the trip.- Uploading data through the networkis time-consuming and, withoutencryption, can be insecure-Cloud computing-HPCsCloud computing is not socost expensive as HPCs butHPCs are believed to handleExabyte datasets muchbetterHPCs are very much expensive ant itstotal cost over a year is hard tomaintain. On the other hand, cloud isbelieved that cannot cope with therequirements for such huge datasetsScalability allows thesystem to grow on demandScalabilityScalability exists at thethree levels in the cloudstack. At the Platform levelthere is: horizontal(Sharding) and verticalscalabilityScalability is mainly manual and isvery much static. Most big datasystems must be elastic to cope withdata changesElasticityThere are several elasticitytechniques such as LiveMigration, Replication andResizingElasticity brings the systemthe capability ofaccommodating data peaksMost load variations assessments aremanually made, instead ofautomatizedHeterogeneityData GovernanceDisaster recoveryData UploadingHigh Dataprocessing(Exabytedatasets)Although big data systems are powerful systemsthat enable both enterprises and science to get insightsover data, there are some concerns that need furtherinvestigation. Additional effort must be employed indeveloping security mechanisms and standardizingdata types. Another crucial element of Big Data isscalability, which in commercial techniques aremostly manual, instead of automatic. Further researchmust be employed to tackle this problem. Regardingthis particular area, we are planning to use adaptablemechanisms in order to develop a solution forimplementing elasticity at several dimensions of bigdata systems running on cloud environments. Thegoal is to investigate the mechanisms that adaptablesoftware can use to trigger scalability at different

levels in the cloud stack. Thus, accommodating datapeaks in an automatic and reactive way.Within this paper we provide an overview of bigdata in cloud environments, highlighting itsadvantages and showing that both technologies workvery well together but also presenting the challengesfaced by the two technologies.ACKNOWLEDGMENTSThis research is supported by Early Bird projectfunding, CMU Portugal, Applying Stitch to ModernDistributed Key-Value Stores and was hosted byCarnegie Mellon University under the program forCMU-Portugal undergraduate internshipsREFERENCESChang, V., 2015. Towards a big data system disasterrecovery in a Private cloud. Ad Hoc Networks, 000,pp.1–18.cloudera, 2012. Case Study Nokia:

big data systems can perform (Hashem et al., 2014). Big data, and in particular big data analytics, are viewed by both business and scientific areas as a way to correlate data, find patterns and predict new trends. Therefore there is a huge interest in leveraging th