White Paper HADOOP MIGRATION MADE SIMPLE

Transcription

HADOOP MIGRATION MADE SIMPLEWhite PaperHADOOP MIGRATIONMADE SIMPLEA SINGLE APPROACH TO CLOUD, ON-PREMISESAND MULTI-VENDOR MIGRATIONSBy Steve Jones, Capgemini Global VP, Big Data and Analytics1

HADOOP MIGRATION MADE SIMPLETABLE OF CONTENTSExecutive Summary 3Selecting the Right Tools and Processes 4The Four Phases of Migration: Strategy, Planning, Execution and Adoption 7Post Migration 10Conclusion 11WANdisco, Inc. follows a policy of continuous development and reserves the right to alter, without prior notice,the specifications and descriptions outlined in this document. No part of this document shall be deemed to bepart of any contract or warranty.WANdisco, Inc. retains the sole proprietary rights to all information contained in this document. No part ofthis publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means,electronic, mechanical, photo copy, recording, or otherwise, without prior written permission of WANdisco, Inc.or its duly appointed authorized representatives.WANdisco and the WANdisco logo are trademarks. All other marks are the property of their respective owners.2

HADOOP MIGRATION MADE SIMPLEHADOOP MIGRATION MADE SIMPLE:A Single Approach To Cloud, On-Premises and Multi-Vendor MigrationsEXECUTIVE SUMMARY Consolidation on a single Hadoop distribution or Hadoop-as-Many firms are facing the challenge of transitioning depart- Economies of scale offered by cloud-basedstorage and pro-mental and niche Big Data programs into the informationfabric of their enterprise. This shift often involves revisitingprevious decisions regarding vendors and approach. It alsorequires migration and consolidation to be core competencies.Without them, any strategy will be based on what is currentlyavailable and not what’s needed for the future.To be successful a migration or consolidation needs to be ableto overcome key hurdles including: Downtime during migration Data and security model consistency New environment verification before a switch overThere are numerous business and technical benefits to begained by migrating from oneHadoop distribution to another,a-Service (HaaS) cloud platformcessing, with access to a range of powerful cloud analyticsapplications and other servicesThis white paper provides a systematic approach to on- premises and cloud Hadoop migrationand shows: The tool used for migration is the key to avoiding downtimeand business disruption To avoid migration downtime, the tool must be transactionaland multi-directional, allowing a phased migration thatenables old and new clusters to operate in parallel whiledata moves between them as it changes, until migration iscomplete A comprehensive migration plan is critical regardless of thetool used, to ensure that organizational goals are metwhether onpremiseor in the cloud. These include:When looking at the shift to Big Data as a business platform Business consistency which helps drive greater degrees ofbe able to answer the key question: “How do I get from what Icollaboration Consolidated investments across multiple business areasfor insight and its impact on current efforts it is essential tohave now to what I need in the future?”Active migration is a central part of answeringthat question. Improved functionality and performance offered by adifferent Hadoop distribution, or an updated version of thesame distribution,which effectively becomes a migration ifthe underlying Hadoop file system format changes betweenreleases Lower support costs offered by competing Hadoop distribution vendors33

HADOOP MIGRATION MADE SIMPLESELECTING THE RIGHT TOOLS AND PROCESSESPassive or Active?Active migration enables you to continually synchronize be-When it comes to migration there are two broad choices:passive or active. Passive migration is what Data Warehousepeople will be comfortable with, and what Hadoop vendorsprovide out of the box. It essentially involves taking a largeextract at a given point in time, loading it into the new environment, shutting off the old environment, loading any additionaltween the old and new environments. Both are live and operatein parallel during migration. Data, applications and users movein phases. Complete transition to the new environment doesn’ttake place until it’s proven to be identical and defined acceptance criteria are met. This approach eliminates downtime andenables you to shift vendors rather than stick with only one.data, and then turning on the new environment.This means downtime. It also normally means sticking withone vendor as their tools are designed to move data from oneversion of their platform to another, not to migrate away.OLDNEWTake full extract(at night)CreateEnvironmentShutdownLoadTake DeltaVerifyLoad DeltaVerifyFigure 1: Passive Migration approach from old to new environment requires downtimeOLDInstallConnectorFeed to e 2: Active Migration approach from old to new environment requires no downtime and allows you to shift vendors4Switch

HADOOP MIGRATION MADE SIMPLEMigration Challenges —Why Old School Won’t WorkIn The New World“In the box” Hadoop Migration On-Premises If the underlying Hadoop file system format is differentbetween the source and target clusters, custom softwaredevelopment may be required to support complex datatransformation requirements. Data loss often results fromincorrectly translating, overwriting, or deleting data. Even a small Hadoop data node server will have at least 10Hadoop migration projects most often rely on DistCp, thephysical disks. In a cluster of any size, it’s almost inevitableunidirectional batch replication utility built into Hadoop. DistCpthat one or more may be lost or damaged in transit.is at the heart of the backup and recovery solutions offeredby the Hadoop distribution vendors. It’s the tool they and theirsystems integrator partners most frequently rely on to delivermigration services, and its limitations are at the root of theHadoop to Cloud MigrationHadoop distribution vendors have also added support to theirDistCp solutions for moving datato the cloud, but the samedowntime and disruption migration projects face.challenges faced with on-premise Hadoop migration remain.With DistCp, significant administrator involvement is requiredFor large-scale data migration, some cloud vendors offer anfor setup, maintenance and monitoring. Replication takesplace at pre-scheduled intervals in what is essentially ascript-driven batch mode of operation that doesn’t guaranteedata consistency. Any changes made to source cluster datawhile the DistCp migration process is running will be missedand must be manually identified and moved to the new targetcluster. In addition, DistCp is ultimately built onMapReduce and competes for the same MapReduce resources production clusters use for other applications, severelyimpacting their performance. These drawbacks requireproduction clusters to be offline during migration, and they’rethe same reasons cluster backups using DistCp during normalappliance-based approach. Typically a storage appliance isdelivered to thecustomer’s data center and data is copiedfromthe customer’s servers to the appliance. Theapplianceis then shipped back to the cloudvendor for transfer to theirservers to complete the process, which often takes morethan a week. While this may be suitable for archiving cold,less critical data to the cloud, it doesn’t address migration ofon-premise data that continues to change.In addition, such an approach doesn’t address elastic datacenter, or hybrid cloud use cases for on-demand burst-outprocessing in which data has to move in and out of the cloudcontinuously. This also doesn’t meet requirements for offsiteoperation must be done outside of regular business hours.disaster recovery with the lowest possible RTO (recoveryThis necessarily introduces the risk of data loss from anyor server outage, nor does it enable the lowest possible RPOnetwork or server outages occurring since the last after hourstime objective) to get back up and running after a network(recovery point objective) to minimize potential data loss frombackup.unplanned downtime. In many industries, both are mandatedAnother migration technique is to physically transport hard-of minutes.by regulatory as well as business requirements to be a matterdrives between old and new clusters. In addition to downtimeand limited resource utilization during migration, there areother challenges with this approach:5

HADOOP MIGRATION MADE SIMPLEOvercoming Migration ChallengesThe only way to avoid migration downtime and disruption isto use a tool that allows existing and new clusters to operatein parallel. This kind of migration experience can only beachieved with a true active transactional replication solutioncapable of moving data as it changes in both the old and newclusters, whether on-premise or in the cloud, with guaranteedconsistency and minimal performance overhead.With an active transactional migration tool, applications canbe tested to validate performance and functionality in boththe old and new environments while they operate sideby- side.Data, applications, and users move in phases and the old andnew environments share data until the migration process iscomplete. Problems can be detected when they occur, ratherWANdisco LiveData Platform overcomes migrationchallenges by: Eliminating migration downtime and disruption with patented one-way to N-way active transactional data replicationthat captures every change, guaranteeing data consistencyand enabling old and new clusters to operate in parallel.LiveData Platform delivers this active transactional datareplication across clusters deployed on any storage supporting the Hadoop-Compatible File system (HCFS) API, localand NFS mounted file systems running on NetApp, EMCIsilon, or any Linux-based servers, as well as cloud objectstorage systems such as Amazon S3. This eliminates manyrestrictions that would otherwise apply during migration. Simplifying consolidation of multiple clusters running onthan after a period of downtime when they may be impossibleany mix of distributions, versions and storage onto a singleto resolve without restarting the entire migration process,platform. Clusters and data in the new post- migrationextending downtime even further.environment can automatically be distributed in anyconfiguration required both onpremise and in the cloud.In addition, the tool must be agnostic to the underlyingThis makes it easy to bring new data centers online, or retireHadoop distribution and version, the storage it runs on, andexisting ones as part of a migration project.in the case of cloud migration, the cloud vendor’s objectstorage. The migration tool should also be capable of handlingdata movement between any number of clusters if the goalis consolidation onto a single big data platform, whether onpremise or in the cloud. WANdisco LiveData Platform is sucha solution. Allowing administrators to define replication policies thatcontrol what data is replicated between clusters andselectively exclude data from migration to specific clustersin the new environment, or move it off to be archived. Providing forward recovery capabilities that allow migrationto continue from where it left off in the event of any networkor server outages.6

HADOOP MIGRATION MADE SIMPLETHE FOUR PHASES OF MIGRATION: STRATEGY,PLANNING, EXECUTION AND ADOPTIONEven with the best technologies, a clear strategy supported by a comprehensive migration plan is required to ensure thatorganizational goals are met. This is the case regardless of whether you’re migrating from one on-premise cluster to another, orplanning a more complex consolidation project across multiple data centers behind the firewall and in the cloud.STRATEGYPLANNINGResearchTest PlanADOPTIONCutover PlanEXECUTIONEnvironmentsDataMovementTesting &VerificationCutoverDecommisions& OperationsServicesMovementFigure 3: The Four Phases of Migration: Strategy, Planning, Execution, AdoptionStrategyPlanningThe first phase is to define a strategy that outlines:The second phase is to define a plan that: Organizational goals and objectives based on the priorities Clearly defines the order and timing ofeach task during theand expectations of your development, operations andexecution phase with adetailed project plan that includesend-user organizations, both pre-and post-migration.estimates,dependencies, roles and responsibilities. The scope of the migration effort. WANdisco LiveDataPlatform can support projects that require moving dataacross any number ofclusters running on a variety ofdistributions, file systems and cloud storage environmentssimultaneously without disruption. This allows projects withmuch broader scope than migrating a single active cluster Produces a detailed test-plan that reflects theacceptancecriteria defined with stakeholders during the strategy phase Defines priorities and expectations, both preandpost- migration. This research will alsohelp gather the informationrequired to define an adequate test plan.from one Hadoop distribution to another to be completed ina much shorter timeframe. A clear description of the expected benefits and acceptancecriteria for your migrationproject. A complete list of risks and their impact on the organization(e.g., an estimate of the cost of any downtime). Well-defined roles and responsibilities formigration tasksand deliverables.7

HADOOP MIGRATION MADE SIMPLEExecutionAdoption1. Establish the New Environment1. Testing and VerificationStep one is to establish the new environment, and validate itsThis stage measures the new environment against the setcorrect implementation before moving data to it.of acceptance criteria defined in your migration plan. Usingapproaches such as mRapid and LEAP from Capgemini thisWith WANdisco LiveData Platform’s patented LiveData capa-functional and analytical equivalence, also known as outcomebilities, all clusters are fully active, read-write at local networkequivalence, can be automated. These tools and solutionsspeed everywhere, continually synchronized as changes areenable the automated testing of reports, HIVE and othermade on either cluster, and recover automatically from eachHadoop based data technologies as well as more complexother after an outage.analytical models such as R.2. Migrate DataBy automating the process and not relying on visual confirma-WANdisco LiveData Platform allows data transfer to take placetion a business is not only able to more rapidly and accuratelywhile operations in both the old and new clusters continue asnormal. You can test applications and compare results in boththe old and new environments in parallel and validate that dataverify outcome equivalence but also do so at a much lowercost base than human driven approaches.has moved correctly and applications perform and function2. Cutoveras expected. WANdisco LiveData Platform can also replicateCutover can be handled in a few ways depending on thedata selectively to control which directories go where. Data notbusiness requirements and plan, but thanks to WANdisco’sneeded post-migration can be moved off for archiving.LiveData capabilities, data and services between environmentscan be transitioned in a phased approach rather than needingIf network or server outages occur during migration, WANdis-a ‘big bang’ approach where everything moves in a singleco LiveData Platform has builtin forward recovery features thatbound.enable migration to automatically continue from where it leftoff without administrators having to do anything.3. Migrate ApplicationsMigration must include moving the analytical services andother applications that run on your existing Hadoop platform.To facilitate this, it’s crucial that services are moved to‘shadow running’ often in a headless mode where they aredisconnected from enterprise systems, but where functionalityand performance can be tested between the old and newenvironments. This migration will often require applicationsto be modified in some way to take advantage of, or removereliance on a particular vendor’s tool set.At the end of execution the company has two active Hadoopclusters which should be data, functional and analyticallyequivalent. At this point, the foundation for adoption of thenew environment is in place.8By enabling this approach it becomes possible to performancetest and verify the cut-over, and in Cloud based environmentsto actively scale the cluster as new services switch fromshadow running to taking enterprise load. At the end of cutover, all services are transitioned and no enterprise or businessfunctions are relying on the old cluster.

HADOOP MIGRATION MADE SIMPLE3. Decommissioning and OperationsThe final stage is the transition towards standard operations forthe new cluster and the decommissioning of the old cluster.Or, WANdisco’s LiveData approach can continue to be used toprovide additional options and benefits. With LiveData, the oldcluster can continue to be used; either as a disaster recovery(DR) environment, or as a hybrid, acitve-active configuration,where the old and new environments are used in parallel, but forpotentially different applications or workloads, while WANdiscoLiveData Platform continues to keep the data consistent acrossthe different environments.OLDClusterVendor AActiveNEWClusterVendor AActiveDecomissionVendor BActiveMigrateSyncVendor BActiveVendor BActiveFigure 4: No”BigBang”Cutover9

HADOOP MIGRATION MADE SIMPLEPOST MIGRATIONPost-migration WANdisco LiveData Platform enables:Continuous availability with guaranteed dataconsistencyMinimal data security risksIn addition to working with all of the available on-disk andnetwork encryption technologies available for Hadoop, WANdisco LiveData Platform only requires the LiveData PlatformWANdisco LiveData Platform guarantees continuous avail-servers to be exposed through the firewall for replicationability and consistency with patented active-transactionalbetween on-premise data centers, and to the cloud. Thisreplication for the lowest possible RTO and RPO across anydramatically reduces the attack surface available to hackers.number of clusters any distance apart, whether on-premiseor in the cloud. Your data is available when and where youIn contrast, DistCp solutions require every data node in everyneed it. You can lose a node, a cluster, or an entire data center,cluster to be able to talk to every other through the firewall.and know that all of your data is still available for immediateThis creates an untenable level of exposure as well as anrecovery and use. When your servers come back online,unreasonable burden on network security administrators asWANdisco LiveData Platform automatically resynchronizescluster size grows.your clusters after a planned or unplanned outage as quicklyas your bandwidth allows.100% use of cluster resourcesActive-transactional hybrid cloudThe same unique capabilities that support parallel operationof on-premise and cloud environments during migrationWANdisco LiveData Platform eliminates readonly backupenable WANdisco LiveData Platform to support true pub-servers by making every cluster fully writable as well aslic-private hybrid cloud deployments postmigration. WANdis-readable and capable of sharing data and running applicationsco LiveData Platform transfers data as it changes betweenregardless of location, turning the costly overhead of dualcloud environments and on-premises Hadoop clusters withenvironments during migration into productive assets. Asguaranteed consistency.a result, otherwise idle hardware and other infrastructurebecomes fully productive, making it possible to scale up yourHadoop deployment without any additional infrastructure.Selective replication on a per folder basisWANdisco LiveData Platform allows administrators to definereplication policies that control what data is replicatedbetween Hadoop clusters, on-premise file systems and cloudstorage. This enables global organizations to only replicatewhat’s required, and keep sensitive data where it belongs tomeet business and regulatory requirements.1010

HADOOP MIGRATION MADE SIMPLECONCLUSIONHadoop migration strategies and the tools that support them need to account for a wide variety of requirements.In summary, with a LiveData approach to migration you have the ability to: Operate both old and new clusters in parallel, without stopping operation in the old cluster either during, or after migration. Make data produced in your new production cluster available in the old cluster infrastructure as part of a hybrid cloud strategy. Test applications in parallel in the old and new environments to validate functionality and performance. Phase your migration of data, applications and users. Consolidate multiple clusters in a distributed environment running on a mix of distributions and storage onto a singleon-premise or cloud platform distributed in any manner your organization requires. Eliminate the need to restrict your production environment to a single cluster. Both old and new, or a combination of multipleclusters and cloud storage environments, can be operational and work on the same underlying content on an opt-in basis.With WANdisco’s LiveData approach to big data migration to the cloud, there is no applicationdowntime during migration, no riskof data loss, and no data inconsistencies even when data sets are under active change.For more information please visit:www.wandisco.com/platform and www.capgemini.com/insights-data11

5000 Executive Parkway, Suite 270 San Ramon, CA 94583www.wandisco.comTalk to one of our specialists todayUS EMEAAPACAll other 1 877 WANDISCO (926-3472) 44 (0) 114 3039985 61 2 8211 0620 1 925 380 1728Join us online to access our extensiveresource library and view our webinars.Follow us to stay in touchCopyright 2020 WANdisco, Inc. All rights reserved.WP-HMMS-200922

Consolidation on a single Hadoop distribution or Hadoop-as-a-Service (HaaS) cloud platform Economies of scale offered by cloud-basedstorage and pro-cessing, with access to a range of powerful cloud analytics applications and other services This white paper provides a systematic approach to on- prem-ises and cloud Hadoop migrationand shows: