Disaster Recovery Best Practices - Ministry Of Electronics And .

Transcription

Disaster RecoveryBest PracticesVersion 1.0Ministry of Electronics& Information Tech.,Government of IndiaCloud Management Office

DISCLAIMERThis document has been prepared by Cloud Management Office (CMO) under Ministry ofElectronics and Information Technology (MeitY). This document is advisory in nature andaims to provide information in respect of the GI Cloud (MeghRaj) Initiative.Certain commercial entities, technology, or materials may be identified in this document inorder to describe a concept adequately. Such identification is not intended to implyrecommendation or endorsement by MeitY.While every care has been taken to ensure that the contents of this Document are accurate andup to date, the readers are advised to exercise discretion and verify the precise currentprovisions of law and other applicable instructions from the original sources. It representspractices as on the date of issue of this Document, which are subject to change without notice.The readers are responsible for making their own independent assessment of the informationin this document.In no event shall MeitY or its' contractors be liable for any damages whatsoever (including,without limitation, damages for loss of profits, business interruption, loss of information)arising out of the use of or inability to use this Document.Page 2 of 29

Table of Contents1Purpose . 42Background. 53Why use Cloud Disaster Recovery? .74Disaster Recovery Principles . 95Types of Disasters .106Guidelines/Best Practices for adoption of Disaster Recovery. 126.1Identify criticality of Data & Applications.136.2Selection of DR Site Architecture .156.3Selection of Replication Technology .166.4Understanding Bandwidth Requirements .186.5Disaster Recovery as a Service (DRaaS) .196.6Documenting DR Plan .216.6.1Roles and Responsibility. 216.6.2Segregation of Responsibility between CSP, MSP and a User Department . 226.6.3Scope and Dependencies . 226.6.4Service Level Agreement . 236.7Validating DR Readiness.246.8Hybrid Cloud DR Scenarios .256.9Government Laws and Regulations on Disaster recovery Site .25Annexure 1 . 26Terminologies related to DR .26Cost considerations for setup of Disaster Recovery* .28Page 3 of 29

21 Purpose5MeitY has introduced MeghRaj initiative to utilize and harness the benefits of CloudComputing in order to accelerate delivery of e-services in the country. It has led to a significantadoption of Cloud technology across Government Departments. As there is a significantupsurge in digital information throughout the Government eco-system, there is an increasedneed of preparing our digital ecosystem to overcome any disasters, without a considerableimpact on public services. This document will assist User Departments in evaluating andconsidering the best practices suitable for their respective departments in terms of Disasterrecovery and ensuring business continuity.6.16.2Adoption of Disaster recovery setup is important for allDepartments to maintain availability of Government Operationsand resiliency of data/applications.6.36.46.56.66.9Page 4 of 297

22 Background5As there is a constant rise in information systems and electronic data, the rise in vulnerabilityof such data has also increased exponentially. The disruptions can be seen ranging from mild(e.g., short-term power outage, disk drive failure) to severe (e.g., site destruction, fire).Though the vulnerabilities may be minimized or eliminated through management,operational, or technical controls, as part of the departments resiliency effort, however, it isvirtually impossible to completely eliminate all risks.6.1One of the challenges for User Departments is ensuring the operations remain unaffected,even during adverse times. Disasters can strike at any moment, leading to socio-economic andreputational losses.6.2The guideline focuses on describing Disaster recovery planning and detailing out theconsiderations and best practices which should be followed to mitigate the risk of system andservice unavailability by providing effective and efficient solutions to enhance businesscontinuity.6.36.4What is Disaster Recovery?Disaster Recovery (DR) aims at protecting the Department from the effects of significantcatastrophic events. It allows the Departments to quickly resume mission-criticalfunctions after a disaster. Below figure explains various possible disasters that can take place.6.56.6Figure 1: Disaster scenarios6.9Page 5 of 297

2The goal for any Department with DR is to continue operating as close to normal as oachthatencompasses hardware and software, networking equipment, power, connectivity, andtesting that ensures Disaster Recovery is achievable.Disaster Recovery in Cloud56.1Disaster Recovery in Cloud entails storing critical data and applications in Cloud storage andfailing over to a secondary site in case of any disaster. Cloud Computing services are providedon a pay-as-you-go basis and can be accessed from anywhere and at any point of time. Backupand Disaster Recovery in Cloud Computing can be automated, requiring minimum manualinterventions.6.26.36.46.56.66.9Page 6 of 297

23 Why use Cloud Disaster Recovery?A Cloud Disaster Recovery provides numerous key benefits, over other types of disasterrecovery strategies:Easy Scale Up 5 6.16.2Pay as you go6.36.46.5GeographicRedundancy6.6Departments can scale Cloud DR effortlessly, since it is veryeasy to increase the amount of resources that can be backedup in the Cloud by purchasing more cloud infrastructurecapacity.Data centres on the other hand require sufficient servercapacity to ensure high level of operational performance andallow data centre to scale up or scale out, depending on theDepartments requirement. With Cloud based DR, there is no need to invest upfront inhardware, or to pay for more infrastructure than the actualuse at a given time. A Cloud-based Disaster Recovery serviceprovides virtual machine snapshots of physical or virtualservers from the primary data centre. The Department canpay for storing the snapshots, application data in a suspendedstate, and replication of data from primary to the secondary(cloud DR) site for data synchronization. It has to pay for theinfrastructure-as-a-service feature only in case of a disaster,wherein virtual machines (snapshots of primary servers)need to be brought online as a substitute for the primary site. While, a secondary physical DR site means investments in anadditional data centre space, connectivity and servers, it leadsto additional operational costs pertaining to power andcooling, site maintenance, and manpower requirements. Cloud-based Disaster Recovery makes it possible to leveragegeographic redundancy features. This means thatDepartments can spread backed-up resources across multiplegeographic regions in order to maximize their availability,even if part of the cloud that is used, fails. Whereas, on the other hand, it is costly to keep multiple DRsfor same data in traditional DR setups6.9Page 7 of 297

2FasterRecovery5 With Cloud Disaster Recovery services, the DR site can bebrought online within seconds or minutes—as opposed to aphysical DR site. A virtual machine instance can be up andrunning within seconds. Typically, a physical DR site operatesonly during data replication, or in the event of an actualdisaster. The time taken to make a DR site live will be more,in comparison to a Cloud DR. In addition, data loss is directlyrelated to downtime. A Cloud DR site that boots up within afew seconds translates to data loss of just that timeframe.6.16.26.36.46.56.66.9Page 8 of 297

24 Disaster Recovery PrinciplesDistance The distance for a DR site can vary depending on the types ofdisaster — such as earthquakes, floods, terror attacks, etc. TheDepartments should choose a DR location that fits itsbusiness model and regulatory requirementsLatency and performance of applications depends ondistance in Disaster scenarios5Recovery TimeObjective (RTO) 6.1 6.26.3Recovery PointObjective (RPO) RTO refers to the time an application can be down withoutcausing significant damage to the businessApplications should be categorized by priority and potentialbusiness loss in order to focus on applications which are morecritical firstApplications requiring near zero RTO require failoverservicesRPO refer to Departments data loss toleranceDepending on application priority, individual RPOs typicallyrange from 24 hours, to 12, to 8, to 4; down to near-zeromeasured in secondsNear-zero RPOs will require continuous replication4-hour RPOs will need scheduled snapshot replication6.46.56.66.9Page 9 of 297

25 Types of DisastersA disaster can be related to any incident (both intentional and/or non-intentional) that causessevere damage to the operations and data of any Organization.There are three major type of disasters:56.16.2Natural kes, floods, etc.)(Chemical releases,power outages, etc.)(Cybercrime, human error,terror attacks, etc.)There can be various scenarios of disasters for which Departments should be preparedbeforehand. The outages can range from a simple application failure to the disaster of wholeData centre. Below table shows some of the scenarios and he way Departments can deal withsuch outages.6.36.46.56.66.9Page 10 of 297

2Organizations can be categorized based on Disaster recovery planning:56.1No recovery plansSuch Departments fail to restore operationseven during minor outages such a, powersurge or server crashBackup of data exist but thereare no plans for DisastermanagementIn such cases, Departments need to back uptheir data regularly so that they can retrievetheir data on the newly replaced systems incase of failure.There is a backup data planand external site to keep thebacked-up dataSuch Departments cannot tolerate to keeptheir systems down for an extended period.They have an arrangement to restore therequired backed-up data which is kept atexternal site also called as data Off-sitingRemote, redundant sites asbackupDepartments which have multiple datacentres (at least two) that are located far awayfrom each other. These data centres areinterlinked with a strong communicationnetwork that facilitates the quick transfer ofdata in case of any disaster at either of thesecentres.An exact replica of the workingdata systemThis is where the data is backed up almostimmediately per hour, per minute or even persecond. With this method, Departments canrecover from a disaster almost immediately.Even though this method is the mostefficient, it is the most expensive as well.6.26.36.46.56.66.9Page 11 of 297

26 Guidelines/Best Practices for adoption of Disaster RecoveryThe Journey towards Disaster recovery setup provides step wise guidance in identifying theDR strategy suitable for their respective business. Below diagram depicts the steps which canbe followed while planning for DR site.5I. Identify thecriticality of Data &ApplicationsVI. Documenting DRPlan (roles andResponsibility,governance, SLA)VII. Validating DRReadinessII. Selection of DRSiteV. Disaster Recoveryas a Service (DRaaS)VIII. Using CloudEnvironment for DRIII . Selection ofReplicationMethodologyIV. Assessment ofBandwidthRequirementsIX. Government Lawsand Regulations onDRS6.16.26.36.4Figure 2: Journey towards DR6.56.66.9Page 12 of 297

26.1 Identify criticality of Data & ApplicationsBefore implementing Disaster Recovery Site, it is important to classify, and group applicationsbased on criticality. Such grouping of applications will help Departments to distinguish lineof applications from each other in terms of their importance to the Departments, as well astheir relative scope of influence on them.56.16.2MeitY has launched IGCSF Toolkit, as a part of Risk & Security Assessment Decision makingframework, which will help Departments to categorize their critical applications. The impactis divided into three categories, viz.1. Assessment of impact on Departments (Tangible), in case of security breach2. Assessment of impact on Departments (Intangible), in case of security breach3. Assessment of impact on individual, in case of security breachBased on the impacts (high, medium and low), Departments can categorize their respectiveapplication.Categorizing business requirements based on priorities should be finalized. The belowclassifications will detail out the baseline for decision-making matrix.6.3Criticality Level6.46.5Failures of applications in this class can resultin:Mission Critical data/applications Widespread stoppage of applications withsignificant impact on Government operations Public, wide-spread damage to GovernmentreputationEssential data/applicationsCore data/applications6.6Supporting data/application Direct impact on operationsDirect negative user satisfactionCompliance violationNon-public damage to Government reputationIndirect impact on operationsIndirect negative user satisfactionSignificant Government department productivitydegradationModerate Government department productivitydegradationIn addition to determining the criticality of applications, it is also necessary to understand thecriticality of the Departments data6.9The Departments data remains equally critical as the data has evolved fast from mere excel orspreadsheet records to representing communication such as e-mail and important digitaldocuments. However, not all data in an enterprise is mission critical. It is important to classifydata and define the associated metrics for retention, retrieval and archival. Missing this canPage 13 of 297

2increase costs exponentially (storage, backup, management, etc.). Classification helps innarrowing down the actual data that needs to be recovered in the case of a disaster.Low ImpactAll data and systems that does not require immediate restorationfor the Departments to continue its operationsModerate ImpactAll data and systems that are important Departments can operatebut in a diminished stateHigh ImpactAll data and systems without which Departments operation cancome to a halt.56.1User Departments should classify application and data based on criticality, asall the data and applications cannot be mission critical.6.26.3A survey that Forrester conducted in 2017 found that only 18% of organizations use eitherDRaaS (Disaster Recovery as a Service) offerings or public cloud IaaS offerings. On the otherhand, Gartner estimates that the size of the DRaaS market will exceed that of the market formore traditional subscription-based DR services by 2018.6.46.56.66.9Page 14 of 297

26.2 Selection of DR Site ArchitectureBased on the criticality of applications and data, User Departments need to determine the bestsuited Disaster recovery site for their respective operations and perform evaluation of cost forselection of type of DR Site.5User Departments can select between internal and external Disaster recovery sites, based ontheir respective requirements. Below diagrams depicts the major difference between the twosites.Internal Disaster Recovery Site:External Disaster Recovery Site:6.1When to use: Require aggressiveRTO, require control over all aspects ofthe DR process.When to use: When Departmentsrequire cost effective DR Sites.6.2Considerations: Expensive than anexternal site, Internal site needs to bebuilt up completely by the DepartmentConsiderations: An outside providerowns and operates an external DR site.3 types: Hot site, Warm site, Cold site6.3Distance is a key element in disaster recovery. A closer site is easier to manage, but it shouldbe far enough that it's not impacted by the drive up and staff costs.6.4External Disaster Recovery SiteHot Site6.56.6 Used for businesscritical apps Fully functional DC Ready in the event ofdisaster It can be of 2 types:o Active-Active- Bothsites are liveo Active-Passive- Datais replicated inpassive siteWarm SiteCold Site Data is replicated butservers may not beready Takes time to bringup servers to recoverapplication in warmsite Designed to be usedfor no- businesscritical apps Not ready forautomatic failover High risk of data loss May take weeks torecover, as data frombackup have to beloaded into servers MinimalinfrastructureAs per guidelines by MeitY, minimum distance required between DC and DR should not be6.9less than 100 kms.Page 15 of 297

26.3 Selection of Replication TechnologyData Replication is a way to ensure that Departments are prepared for disasters. Replicationcreates copies of data at varying frequencies, depending on the data in question and theindustry of the organization backing it up. In the event of a disaster, the primary systemsfailover to this replicated system.There are majorly two types of data replications:51. Synchronous Replication 6.1 6.2 6.36.42. Asynchronous Replication Copies of data is created in realtime on secondary site and locallyBusiness continuityVery Low RTO RPOMinimizedowntimesandassure a high infrastructuralavailabilityLimits:o The two sites cannot be farfrom each othero Expensive methodology Replication methodologies can also be controller based. Some of the methodologies elow:6.56.66.9Figure 3Page 16 of 297It creates copies of data as perdefined scheduleIt is suitable for Departments thatcan endure longer RTOs.No distance limitsIt allows to protect business evenin case of large-scale disasterswhich may damage both sites (forinstance, an earthquake)

25Departments that rely on mission critical data and cannot compromise on RTOs,can effectively leverage synchronous replication, while Organizations that canendure longer RTOs but need cost effective disaster recovery can useasynchronous replication. Also, Application based replication is the leastpreferred replication option due to dependency on individual Applicationvendor.6.16.26.36.46.56.66.9Page 17 of 297

26.4 Understanding Bandwidth Requirements5Bandwidth and latency are equally critical as other factors while planning Disaster Recovery.Departments which replicate data for potential failover, both locally and remotely, should takebandwidth requirements into account while planning the DR site. The planning phase of acloud-based DR implementation involves not only calculations with regard to keeping the offsite data up-to-date and within SLAs, but also with regard to user traffic when an actualrecovery is needed. It is important to have data reside closer to its respective user departmentsas well as the applications or workloads which are being accessed.There are two major factors which impact bandwidth requirement decision. Figure belowexplains the factors:6.16.26.36.4Figure 46.5Themajor considerations while estimating Bandwidth requirements whileplanning a DR site are:6.6 While transferring data to the Cloud, sufficient bandwidth is required. Hencebased on the application and data capacity and criticality, Departmentsneed to specify the estimated bandwidth requirement.Department needs to specify the requirement of redundant networkconnectivity between DC and DR siteIt is necessary to determine the network bandwidth requirements in Disasterscenarios, making the data accessible to its users after occurrence of adisaster6.9Page 18 of 297

26.5 Disaster Recovery as a Service (DRaaS)Since it is expensive to maintain a dedicated DR site, User Departments can choose tooutsource this cost. Replacing the cost of dedicated site with a predictable expense iscomparatively better option.56.16.26.36.46.56.6Disaster recovery as a Service (DRaaS) enables full replication and backup of all cloud dataand applications while serving as a secondary infrastructure. It actually becomes the newenvironment and allows an organization and users to continue with daily operations while theprimary system undergoes restoration.Reasons to consider Disaster Recovery as a Service (DRaaS) over On-premise DRSite:1. On-demand provisioning: All cloud services offer on-demand self-servicefunctionality. Once the service is initiated by the user, it takes only few minutes to getcommissioned, which is much faster than commissioning the same service on premise.2. Easy Scalability: Cloud services can be scaled exponentially. Adding resources to acloud-based solution takes very less time and effort. On the other hand, if on premise DRis present, then Departments should be sure about the capacity in order to provide anadequate DR coverage.3. Removes maintenance overhead: As Cloud platforms are part of managed service,maintaining and upgrading the underlying infrastructure is the responsibility of the Cloudservice provider. If internal DR environment is present with the Departments, it also needsto be upgraded with the new features and patches, which require overhead of IT resource.4. DRaaS is cost effective: Cloud Services works on pay per use model which is extremelycost effective as Departments can diligently control their spending by consuming and onlypaying for the resources they use. In case of on-premise environment, resources are oftenunderutilized and do not run at full capacity. Enough hardware procurement is also anoverhead cost.5. Multisite: Resources can be replicated to many different sites to ensure continuousbackup in the event of unavailability of one or more sites.6. Array agnostic: DRaaS replicates any environment and is not service provider specific.6.9Page 19 of 297

2Disaster recovery Management Tool - Disaster recovery management tool is a part ofDRaaS solution. It helps an Organization to maintain or quickly resume its mission-criticalfunctions after a disaster. It is used to facilitate preventative planning and execution forcatastrophic events that can significantly damage a computer, server, or network. It allows anorganization to run instances of its applications in the provider's cloud. The obviousadvantage is that the time to return the application to production, assuming networkingissues can be worked out, is greatly reduced because there is no need to restore data acrossthe Internet.56.16.2DRaaS pricing structure: DRaaS is often made up of several pricing components,including: Replicated data storage cost Software licensing costs (for disaster recovery and business continuity software toprovide data replication) Computing infrastructure cost Bandwidth costSome DRaaS providers only charge for storage and software licensing when the service is notactually being used, adding compute infrastructure and bandwidth costs if the service isactivated in the case of a disaster. Others charge for all components in the form of a "serviceavailability fee," regardless of whether or not the service is actually used.6.36.46.5Some of the key features of a disaster recovery software are: Ease of use Monitoring capabilities Automatic backup of critical data and systems Quick disaster recovery with minimal user interaction. Flexible options for recovery Recovery point and recovery time objectives Compatibility with physical servers Easy billing structure Options for the backup target6.6While selecting DRaaS, Departments should consider the following:DRaaS works on pay as you go model, so Organizations should select Serviceproviders which provide different DRaaS service for different classes ofapplications.In case of non-availability of Disaster recovery setup with primary CSP,services can be availed from other empaneled CSP’s.6.9Page 20 of 297

26.6 Documenting DR PlanWhile documenting DR Plan, Departments should take a holistic view and focus on recoveringthe application services and not just servers. The technical recovery plan for each application/service should be documented in a way that all the activities that need to be performed duringrecovery should be defined in a sequential manner. 5 Design for end to end recoveryDefine recovery goalsMake tasks specific: To make the system up and running, all steps should be predefined. Guess work should not be done. Documenting the steps is neededMaintain more than one DR recovery paths6.16.2It should cover all details such as physical and logical architecture, dependencies (inter- andintra-application), interface mapping, authentication, etc. Application dependency matrix,interface diagrams and application to physical/virtual server mapping play an important rolein defining how applications interact with each other to deliver various functionalities.6.6.16.36.4Roles and ResponsibilityRoles and responsibility should be clearly defined while planning for a Disaster Recovery Site.It should contain a governance structure often in the form of a Business ContinuityCommittee that will ensure senior management commitments and define senior managementroles and their respective responsibilities. The team composition should include below: Disaster Recovery Planning (DRP) Coordinator:The DRP Coordinator shall have comprehensive decision-making powers, member from thehigher Authority expected to lead the DR activities.6.5 6.6The Crisis Management Team shall comprise of Management level personnel who shallanalyze the damage at DC, advise the DRP Coordinator for Disaster Declaration, and initiatethe recovery of Operations at the DR Site. Crisis Management Team (CMT):Damage Assessment Team (DAT):The Damage Assessment Team shall comprise of a management and technical expertisemixture of personnel who shall assess & report the damage at DC and take steps to minimizethe extent of the same. Operations Recovery Team (ORT):The Operations Recovery Team shall comprise of a management and technical expertisemixture of personnel who shall undertake the recovery operations for SDC at the designateDR Site.6.9The Business Continuity Committee will be responsible for:Page 21 of 297

25Clarify their roles of all the members of the committeeOversee the creation of a list of appropriate committees, working groups andteams to develop and execute the planProvide strategic direction and communicate essential messagesApprove the results of Business Impact AnalysisReview the critical services and products that have been identifiedApprove the continuity plans and arrangementMonitor quality assurance activitiesResolve conflicting interests and priorities 6.1Roles and responsibilities of the Business continuity Committee should be clearly defined andwell communicated in the Departments.6.26.6.2Segregation of Responsibility between CSP, MSP and a User DepartmentThe segregation of roles and responsibilities between a Department, MSP and CSP can beseen in the below mentioned matrix:6.36.46.56.6Disaster Recovery ManagementProgram ManagementIntegration with Business ContinuityPlan MaintenanceManagement Actions (Escalations,Declaration and Orchestration)Define application interdependenciesDetermine sequence of recoveryRequirements definition (RTO, RPO)Application ValidationSystem RecoveryApplicationsDR TestingDatabaseMiddlewareCompute (servers)NetworkStorage/DataAlternate Site6.6.36.9On-PremisePaaSSaaSMANAGED BY USER DEPARTMENT ANDMSPMANAGED BY MSPMANAGED BY CSPScope and DependenciesDetermining the most important VMs and including them into the recovery scope can helpachieve shorter recovery time objectives. These VMs should be housing business-criticalinformation, applications. Also, dependency links between these VMs, applications, and ITPage 22 of 297IaaS

2systems should be considered. For example, the operation of a particular application can bedependent on information housed on a different VM or vice versa. Dependencies also existbetween employees and the components of the infrastructure. Figuring out and documentingsuch dependencies is necessary so that the Departments can continue their work with minimalinterruptions.6.6.456.16.2Service Level AgreementService Level Agreement (SLA) as already detailed out in “Guidelines for User Departmentson Service Level Agreementfor procuring Cloud Services” on MeghRaj Guidelines User Department Procuring Cloud%20Services Ver1.0.pdf” can be referred to get more clarity on which SLAs to be negotiatedwhile finalizing the Cloud service offerings. There are some key parameters which should betake

6 Guidelines/Best Practices for adoption of Disaster Recovery The Journey towards Disaster recovery setup provides step wise guidance in identifying the DR strategy suitable for their respective business. Below diagram depicts the steps which can be followed while planning for DR site. Figure 2: Journey towards DR 6.5 I. Identify the