Cloud Databases: A Paradigm Shift In Databases - IJCSI

Transcription

IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012ISSN (Online): 1694-0814www.IJCSI.org77Cloud Databases: A Paradigm Shift in DatabasesIndu Arora1 and Dr. Anu Gupta21Department of Computer Science and Application,MCM DAV College for Women, Chandigarh2Department of Computer Science and Application,Panjab University, ChandigarhAbstractRelational databases ruled the Information Technology (IT)industry for almost 40 years. But last few years have seen seachanges in the way IT is being used and viewed. Stand aloneapplications have been replaced with web-based applications,dedicated servers with multiple distributed servers and dedicatedstorage with network storage. Cloud computing has become areality due to its lesser cost, scalability and pay-as-you-go model.It is one of the biggest changes in IT after the rise of World WideWeb. Cloud databases such as Big Table, Sherpa and SimpleDBare becoming popular. They address the limitations of existingrelational databases related to scalability, ease of use anddynamic provisioning. Cloud databases are mainly used for dataintensive applications such as data warehousing, data mining andbusiness intelligence. These applications are read-intensive,scalable and elastic in nature. Transactional data managementapplications such as banking, airline reservation, online ecommerce and supply chain management applications are writeintensive. Databases supporting such applications require ACID(Atomicity, Consistency, Isolation and Durability) properties, butthese databases are difficult to deploy in the cloud. The goal ofthis paper is to review the state of the art in the cloud databasesand various architectures. It further assesses the challenges todevelop cloud databases that meet the user requirements anddiscusses popularly used Cloud databases.Keywords: Cloud computing; Cloud Databases; DatabaseArchitectures.1. IntroductionInformation Technology (IT) department of anyorganization is responsible for providing reliablecomputing, storage, backup and network facilities at thelowest feasible cost. Huge investment in IT infrastructureworks as a hindrance in its adoption especially for smallscale organizations. Cash-strapped organizations look foralternatives which can reduce their capital investmentsinvolved in purchasing and maintaining IT hardware andsoftware so that they can get maximum benefits of IT.Cloud computing (CC) becomes a natural and ideal choicefor such organizations and customers. Cloud computingtakes benefit of many technologies such as serverconsolidation, huge and faster storage, grid computing,virtualization, N-tier architecture and robust networks. Itdelivers highly scalable and expensive infrastructure withminimal set up and negligible maintenance cost. Itprovides IT-related services such as Software-as-a-Service,Development Platforms-as-a-Service and Infrastructure-asa-Service over the network on-demand anytime fromanywhere on the basis of “pay-as-you-go" model. It is afast growing concept changing the IT related perceptionsof its users. Elasticity, scalability, high availability, priceper-usage and multi-tenancy are the main features ofCloud computing. It reduces the cost of using expensiveresources at the provider’s end due to economies of scale.Quick provisioning and immediate deployment of latestapplications at lesser cost are the benefits which forcepeople to adopt Cloud computing.Cloud computing has brought a paradigm shift not in thetechnology landscape, but also in the database landscape.With more usage of Cloud computing, demand forprovisioning of database services has raised. Provisioningof Cloud databases is known as Database-as-a-Service inCloud terminology. The main objective of the paper is toexplore the trends in Cloud databases and analyze thepotential challenges to develop these databases. Thestructure of paper has been divided into six sections.Second section describes Cloud databases. Third providesan overview of common types of databases. Section 4discusses major challenges to develop cloud databases.Fifth summarizes existing cloud databases followed byconclusions.2. Cloud DatabasesMassive growth in digital data, changing data storagerequirements, better broadband facilities and Cloudcomputing led to the emergence of cloud databases [1].Cloud Storage, Data as a service (DaaS) and Database as aservice (DBaaS) are the different terms used for datamanagement in the Cloud. They differ on the basis of howdata is stored and managed. Cloud storage is virtualstorage that enables users to store documents and objects.Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012ISSN (Online): 1694-0814www.IJCSI.orgShared-nothing Storage architecture involves datapartitioning which splits the data into independent sets.These data sets are physically located on different databaseservers. Each server processes and maintains its piece ofthe database exclusively which makes shared-nothingdatabases easily scalable. Due to inherent scalability,applications designed to work on shared-nothing storagearchitecture are suitable for Cloud. But data partitioningused in this architecture does not work well with cloud. Itis very difficult to virtualize a shared-nothing database asit becomes very complex and difficult to maintain due toShared-disk Database Architecture treats the wholedatabase as a single large piece of database stored on aStorage Area Network (SAN) or Network AttachedStorage (NAS) storage that is shared and accessiblethrough network by all nodes. It requires fewer low-costservers. It is easy to virtualize them as each computeserver is identical. It separates the compute from thestorage as any number of compute instances may work onthe entire data. Middleware is not required to route datarequests to specific servers as each node/client has accessto all of the data. Hence, it is more suitable for On-LineTransaction Processing applications. Oracle RAC, IBMDB2 pureScale, Sybase etc. support this architecture [11].AnalyticalMaintenanceCostUseful NothingSharedDiskACIDTable 1: Comparison of shared-nothing and shared disk storagearchitecturesScalability2.1Shared-nothing Storage Architecture2.2 Shared-disk Database ArchitecturePartitioningCloud database is a database delivered to users on demandthrough the Internet from a cloud database provider'sservers.Cloud databases provide scalability, highavailability, optimized resource allocation and multitenancy. A cloud database can be a traditional databasesuch as MySQL and SQL Server. These databases can beinstalled, configured and maintained on a Cloud server bythe user himself. This option is popularly called the “Doit-Yourself” approach (DIY). Few providers offer readymade database services such as Xeround’s MySQL [4]. In“Do-it-Yourself” approach, the developers manuallyensure reliability and elasticity service. Selection of aDBaaS solution reduces the complexity and cost ofrunning one’s own database. It spares the developer fromthe hassles of tedious management tasks of the database.Cloud databases provide improved availability, scalability,performance and flexibility at lesser price. ConventionalDBMS (Data Base Management System) deals withstructured data which is held in databases along with itsmetadata. While Cloud databases can be used forunstructured, semi-structured data or structured data. Datastored in files of various types where the metadata waseither unavailable or incomplete is called unstructured data.Cloud databases are able to support changing storagerequirements of Internet-savvy users who deal more withunstructured data, user created content such as documentsand photos. Shared-nothing and shared-disk are twowidely-used storage architectures in database systems.data partitioning. It needs a piece of middleware to routedatabase requests to the appropriate server. As moreservers are added, data has to be repartitioned. Datapartitioning should be done very carefully, otherwise datashipping (passing of the information from one machine tothe other machine for processing) and joining will becomedifficult. More data shipping means more latency andnetwork bandwidth bottlenecks. These issues reducedatabase performance badly. Shared-nothing Storagearchitecture is also used mainly for data-intensiveworkloads. IBM and Oracle released their shared-nothingimplementation of DB2 in 1990 and September 2008respectively for scalable analytical applications of datawarehouses. Amazon’s SimpleDB, Hadoop DistributedFile System and Yahoo’s PNUTS also implement sharednothing architecture [5-7].ArchitectureDropbox, iCloud etc. are popular cloud storage services[2]. DaaS allows user to store data at a remote diskavailable through Internet. It is used mainly for backuppurposes and basic data management. Cloud storagecannot work without basic data management services. So,these two terms are used interchangeably. DBaaS is onestep ahead. It offers complete database functionality andallows users to access and store their database at remotedisks anytime from any place through Internet. Amazon’sSimpleDB, Amazon RDS, Google’s BigTable, Yahoo’sSherpa and Microsoft’s SQL Azure Database are thecommonly used databases in the Cloud [3].78Note: N-No, Y- Yes3. A Comparative Study of RelationalDatabases and NoSQL DatabasesIn the earlier stages of computerization, there was moredemand for transaction processing applications. As thedatabase industry matured and people accepted computersas part and parcel of their lives, analytical applicationsbecame the focus of enterprises. Now they wanted to storeCopyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012ISSN (Online): 1694-0814www.IJCSI.orgdata not only for transaction processing, but to analyzeconsumer trends and business needs. Enterprises want touse analytical knowledge to enhance their business value.So, enterprise applications are broadly categorized intotransactional and analytical applications. Relationaldatabases played dominant role in handling transactionaldata. Later on, industry leaders like IBM and Oracle addedanalytical capabilities to their relational databases for datamining applications. In the mean time, number ofdatabases such as Column databases, Object-orienteddatabases etc. came into market [12-13]. But they couldnot overpower the relational databases. Then Internetrevolution and web 2.0 applications started producingmassive sparse and unstructured data. RDBMS are notsuitable for handling massive sparse data sets with looselydefined schemas. The need to store and process such bigdata defined the role of NoSQL databases in the databasetechnology as Cloud databases. RDBMs and NOSQLdatabases are briefly discussed as follows:3.1 Relational DatabasesThe concept of relational databases is forty years old. Itworked best in the era of hardware limits such as smalldisk space, little memory, slow processor speed andlimited networking. It has rigid database architecture basedon tables, columns, indexes, relationships and schema.Data is stored in tables with predefined complexrelationships. Column indexes are used for faster search.Highly skilled Developers and DBAs are required fordatabase design and maintenance. Conventionally, theyare used for transactional databases. They include detailsat the lowest granularity. They contain sensitive andoperational data such as employee data and credit cardnumbers to handle critical business operations. Thesedatabases are not well suited for Cloud environment asthey do not support full content data search and aredifficult to scale beyond a limit [14-15].79They have emerged to address the requirements of datamanagement in the cloud as they follow BASE (BasicallyAvailable, Soft state, eventually consistent) in contrast tothe ACID guarantees. So, they are not suitable for updateintensive transaction applications. They provide highavailability at the cost of consistency [16-17].Table 2: Comparison of RDBMS and NoSQL databasesRDBMSData within a database istreated as a “whole”RDBMS support centrallymanaged architecture.Theyarestaticallyprovisioned.It is difficult to scale them.They provide SQL to querydataACID(Atomicity,Consistency, Isolation .ORACLE, MySQL, SQLServer etc. are popularRDBMS.NoSQL DatabasesEach entity is considered anindependent unit of data andcan be freely moved fromone machine to the otherThey follow ned.They are easily scalable.They use API to query data(not feature rich as SQL).Follow BASE (BasicallyAvailable, Soft state,Eventually consistent); Theuser accesses are guaranteedonly at a single-key level.They support web2.0applications.Amazon SimpleDB,Yahoo’s PNUTS, CouchDBetc. are popular NoSQLDatabases.4. Challenges to Develop Cloud DatabasesCloud DBMSs should support features of Cloudcomputing as well as of traditional databases for wideracceptability, which is a Hercules’s task. The potentialchallenges associated with cloud databases are as follows:3.2 NoSQL databasesNoSQL means ‘Not Only SQL’ or ‘Not Relational’. ANoSQL database is defined as a non-relational, sharednothing, horizontally scalable database without ACIDguarantees. NoSQL implementations are classified furtherinto key/value stores, document stores, object stores, tuplestores, column stores and graph stores. They can store andretrieve unstructured, semi-structured and structured data.They are item-oriented. A domain can be compared to atable and contains items having different schemas. Theitems are identified by keys. All data relevant to aparticular item is stored within that item. It improvesscalability of these databases as complex joins are notrequired to regroup data from multiple tables. They havethe ability to replicate and distribute data over manyservers. They are dynamically provisioned on demand.Fig. 1. Possible issues in makeup of cloud databases.Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012ISSN (Online): 1694-0814www.IJCSI.org804.1 Scalability4.6 Database Security and PrivacyThe main feature of Cloud paradigm is scalability whichimplies that resources can be scaled-up or scaled-downdynamically without causing any interruption in theservice. It puts challenges on developers to developdatabases in such a way that they can support and handleunlimited number of concurrent users and data growth.Enterprises deal with huge volumes of data. Addingadditional servers on demand solve the problem ofscalability, only if the process and workload areparallelizable. Scalability requirement of transactional datais lesser in comparison to analytical data.Data physically stored in a particular country, is subject tolocal rules and regulations of that country. The US PatriotAct allows the government to demand access to the datastored on any computer. Amazon S3 only allows acustomer to choose between US and EU data storageoptions. If data is encrypted using a key not located at thehost, then it is little safer. Risks are involved in storingtransactional data on an untrusted host. Sensitive data isencrypted before being uploaded to the cloud to preventunauthorized access. Any application running in the cloudshould not have the ability to directly decrypt the databefore accessing it. Providing security and privacy todifferent databases on the same hardware is also a bigchallenge.4.2 High availability and Fault ToleranceAvailability of database implies that database is up andrunning 365 X 24 X 7. It becomes necessary to replicatedata across large geographic distances to provide high dataavailability, durability and high levels of fault tolerance.Amazon’s S3 cloud storage service replicates data across“regions” and “availability zones”.4.3Heterogeneous EnvironmentUsers want to access diverse applications from differentlocations and devices such as mobiles, tablets, notepadsand computers. Since user applications and data(structured or unstructured) vary in nature, it becomesdifficult to predefine how users will use the system.4.4 Data Consistency and IntegrityData integrity is the most critical requirement of allbusiness applications and is maintained through databaseconstraints. The lack of data integrity results in unexpectedoutputs. Cloud databases follow BASE (BasicallyAvailable, Soft state, Eventually consistent) in contrast tothe ACID (Atomicity, Consistency, Isolation andDurability) guarantees. So, Cloud databases supporteventual consistency due to replication of data at multipledistributed locations. It becomes difficult to maintain theconsistency of a transaction in a database which changestoo quickly especially in the case of transactional data.Developers need to follow BASE approach cautiously.They should not compromise data integrity in their overenthusiasm to move to cloud databases.4.5 Simplified Query InterfaceCloud Database is distributed. Querying distributeddatabase is a major challenge that cloud developers face. Adistributed query has to access multiple nodes of clouddatabase. There should be a simplified and standardizedquery interface for querying the database.4.7 Data Portability and InteroperabilityVendor lock-in is a key obstacle in the adoption of clouddatabases. Users want the liberty to move from one vendorto another without any hassles. It can be avoided throughportable and interoperable components. Data Portability isthe ability to run components written for one cloudprovider in another cloud provider’s environment.Interoperability is the ability to write a piece of code that isflexible enough to work with multiple cloud providers,regardless of the differences between them. Currently,there are no standard API to store and access clouddatabases. Legacy applications should be able to workwith cloud databases. Cloud databases should also be ableto interface with business intelligence tools alreadyavailable in the market [18-19].5. Industry Practices in Cloud DatabasesCloud databases are designed for low-cost commodityhardware. They scale out easily by distributing thedatabase across multiple hosts/nodes as the load increases.NoSQL databases have become synonym for clouddatabases. Few commonly used cloud databases in theindustry are described below.5.1 Amazon Simple Storage Service (S3) andDatabasesAmazon S3 is Internet based storage service. It storesobjects up to 5GB in size along with 2 KB of Meta data foreach object. Objects are organized by buckets. Eachbucket is owned by an AWS (Amazon Web Services)account. The buckets are identified by a unique, userassigned key. Buckets and objects are created, listed andretrieved using either a REST or SOAP interface. Amazonoffers MySQL, Oracle and Microsoft SQL Server virtualinstances of databases for deployment in its AmazonCopyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012ISSN (Online): 1694-0814www.IJCSI.orgElastic Compute Cloud (EC2) cloud. Even third partymanagement providers like Elastra and Rightscale offerMySQL images. Scaling is not easy with MySQL but itcan be done. EnterpriseDB’s Postgres Plus AdvancedServer, a transactional database also runs in Amazon’scloud. Earlier Storage was tied to the EC2 instance.Termination of instance means loss of data associated withthat instance. With Amazon’s Elastic Block Store (EBS),user can choose to allocate storage volumes that persistreliably and independently from EC2 instances. AmazonRelational Database Service (RDS) is also a web servicethat makes it easy to set up and scale a relational databasein the Cloud. It is designed for developers or businessesthat require the full features and capabilities of a relationaldatabase. It gives access to the capabilities of a MySQL,Oracle or SQL Server database engines running onAmazon RDS database instance [20-21].5.2 Amazon SimpleDBIt is a highly available, scalable and flexible non-relationaldata store. It works closely with Amazon S3 and AmazonEC2 to provide the ability to store, process and query datasets in the cloud. It is NoSQL and name/value pair datastore. It offers a simple interface of Get, Post, Delete andQuery to run queries on structured data. It is comprised ofdomains, items, attributes and values. A domain iscomparable to a table or a worksheet in a spreadsheet e.g.employee table. Domains are further comprised of items(rows) and items are described by attribute-value pairs.Unlike a spreadsheet, it allows cells to contain multiplevalues per entry. Each item can have its own unique set ofassociated attributes(e.g. item “1” might have attributes“Basic” and “tax” whereas item “2” may have attributes“Basic”, “tax” and “Saving”. It provides scalability byallowing user to partition the workload across multipledomains. Initially, user is allocated a maximum of 250domains. User can choose between consistency andeventual consistency. But with complex applications, it isdifficult to maintain data integrity. It allows user toencrypt data before saving it. It does not decode the databut query directly on the strings stored. It automaticallymanages replication, indexing of data and performancetuning [22].5.3 Google App's BigtableIt is a distributed storage system based on GFS (GoogleFile system) for structured data. It implements a replicatedshared-nothing database. It has been successfully deployedin many Google products like Google app engine. It allowsa more complex data store than SimpleDB. It allowsentities and properties comparable to tables and columns.One can create an entity by creating a python object. TheGoogle Datastore API also allows a get, put, delete formatfor accessing data. It also offers a non-SQL language81called GQL () which is not as feature rich as SQL. Selectstatements in GQL can be performed on one table only.GQL does not support the “Join” statement [23, 24].5.4 MapReduceIt is an easy-to-use programming model that supportsparallel architecture. It is very scalable and works in adistributed manner. It is useful for massive data processing,large scale search and data analysis in the cloud. Itprovides an abstraction by defining a “mapper” and a“reducer”. The “mapper” is applied to every inputkey/value pair to generate an arbitrary number ofintermediate key/value pairs. The “reducer” is applied toall values associated with the same intermediate key togenerate output key/value pairs. It has sufficientexpression capability to support many real worldalgorithms and tasks. It can partition the input data,schedule the execution of program across a set ofmachines, handle machine failures and manage the intermachine communication. But it cannot be compared todatabase systems [25].5.5 HadoopIt is a programming framework for implementingMapReduce across large grid of servers. It is distributed innature and has better scalability than relational and columnstore databases. It is more suitable for unstructured data. Itis not for mixed workloads, complex data structures andmultitasking. Hadoop is a Java based open source project.With the support from Yahoo, Hadoop has achieved greatprogress. It has been deployed in a large system with 4,000nodes and is used in many large scale data processingtasks. It enables the addition of Java software Componentsand provides HDFS (Hadoop Distributed File System) andhas been extended to include HBase, a column storedatabase [26].5.6 Windows Azure Cloud StorageThe aim of Windows Azure Storage is to let users andapplications access their data efficiently from anywhere atany time using simple and familiar programming API.They can use scalable storage to store any amount of datafor any length of time on pay per use basis. It supportsstructured as well as unstructured data, NoSQL databasesand queues. It provides three data abstractions: Blobs,Tables and Queues. Blobs provide a simple interface forstoring named files along with metadata for the file. Tablesprovide structured storage. A Table is a set of entities,which contain a set of properties. Queues provide reliablestorage and delivery of messages for an application. Allinformation held in Windows Azure storage is replicatedthree times which allows fault tolerance [27].Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012ISSN (Online): 1694-0814www.IJCSI.org5.7 Microsoft SQL Server Data Services (SDDS)It is a key/value data store, which is also called the cloudextension of Microsoft’s SQL Server. It integrates withMicrosoft’s Sync Framework, which is a .NET library forsynchronizing dissimilar data sources. It provides schemafree data storage, SOAP or REST APIs and a pay-as-yougo payment system. It has three core concepts: Entity,Container and Authority. Entity is a property bag of nameand value pairs. Container is a collection of entities.Authority is collection of containers and acts as a billingunit [28].5.8 SherpaIt was popularly known as PNUTS in earlier publications.Data is organized into tables of records with attributes.Tables can be hashed or ordered. It supports blob data typealong with typical data types. It is a simplified relationaldata model. It supports selection and projection from asingle table and avoids join operation. Data is replicatedasynchronously. It can operate in high availability or highconsistency mode. Hadoop can use Sherpa as a data storeinstead of the native HDFS [29].5.9 DynamoIt is a highly available, scalable and distributed key-valuedata-store used by Amazon’s core services. It useseventual consistency to achieve high level of availabilityi.e. it can write anywhere and update will eventuallypropagate to all replicas asynchronously. There is norecord structure or indexes in Dynamo. It permits onlysingle key updates. It makes extensive use of objectversioning and application-assisted conflict resolution [30].5.10 MegaStoreIt blends the scalability of a NoSQL data-store and theconvenience of a traditional RDBMS to meet the storagerequirements of interactive Internet services such as e-mail,documents, social networking. It uses synchronousreplication to achieve high availability and a consistentview of the data. It provides transactional (ACID)guarantees within an entity group. It is a flexible datamodel with user-defined schema, full-text indexes andqueues [31].5.11 CouchDBCouchDB is a free, open-source, Apache project sinceearly 2008. It is a document-oriented database written inErlang. It belongs to NoSQL generation of databases.Documents (i.e. records) are stored in JSON (JavaScriptObject Notation) format and are accessed through anHTTP interface. It allows "views" to be dynamically82created using JavaScript. These views map the documentdata onto a table-like structure that can be indexed andqueried. It does not support a non-procedural querylanguage. It achieves scalability through asynchronousreplication. It has unique capability to serve as a selfcontained application server and database [32].5.12 MongoDBMongoDB is a GPL (General Public License) open sourcedocument-oriented JSON database system beingdeveloped at 10gen by Geir Magnusson and DwightMerriman. It is designed to be a true object database,rather than a pure key/value store. It stores data in JSONlike documents with dynamic schemas. It provides thespeed and scalability of key-value stores and richfunctionality like indexes and dynamic queries ofrelational databases. It provides horizontal scalability [33].Though NoSQL databases are widely accepted as clouddatabases in the database landscape, they are not a solutionfor all problems. They can work easily with large sparse data,but do not provide transactional integrity, flexible indexing,querying and SQL. They are not able to connect withcommonly used Business Intelligence tools. It is difficult tofind experienced NoSQL programmers, developers andadministrators to install and maintain them. So, Clouddatabases should be used with full awareness of theirlimitations.6. ConclusionsMassive data generated by web-based applications havechanged the whole database scenario. Cloud databasesappear to be a good solution for handling such data.Moreover, all organizations cannot afford to set upexpensive data center infrastructure for managing theirown databases. The growing popularity of Cloud databasesis marking the beginning of new era of databases. Thoughcloud databases are not ACID compliant, they are able tohandle massive workloads of web-based applications,which do not require such guarantees. Different Clouddatabases are available in the market. They share similarconcepts and features such as schema free database, synchronous/asynchronous replication etc. But each has itsunique API, query interface, data model and databasefunctions. These concepts need to be standardized fortheir better growth. Cloud computing and Cloud databasesare set to rule the next decade by overcoming thelimitations they have.References[1] Rajkumar Buyya et al., “Cloud computing and emerging ITplatforms: Vision, hype, and reality for deliveringCopyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.

IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 3, July 2012ISSN (Online): 3]computing as the 5th utility”, Future Generation ComputerSystems, Vol.

data defined the role of NoSQL databases in the database technology as Cloud databases. RDBMs and NOSQL databases are briefly discussed as follows: 3.1 Relational Databases . The concept of relational databases is forty years old. It worked best in the era of hardware limits such as small disk space, little memory, slow processor speed and