Virtual Linux Server Disaster Recovery Planning - SHARE

Transcription

Virtual Linux ServerDisaster Recovery PlanningRick BarlowNationwide InsuranceAugust 10, 2011

Agenda 2DefinitionsOur EnvironmentBusiness Recovery Philosophy at NationwidePlanningExecutionThis information is for sharing only and not an endorsement by Nationwide Insurance

Definitions High Availability– “With any IT system it is desirable that the system and its components (bethey hardware or software) are up and running and fully functional for aslong as possible, at their highest availability. The most desirable highavailability rate is known as “five 9s” , or 99.999% availability. A gooddeal of planning for high availability centers around backup and failoverprocessing and data storage and access.”– Deal with significant outage within data center LPAR failure Operating System outage Application ABEND3This information is for sharing only and not an endorsement by Nationwide Insurance

High AvailabilityLPAR1ProductionLPAR6LPAR2LPAR3VLAN 1HTTPServerWASNodeVLAN 3vSwitchOSA OSAvSwitchOSA OSACisco Switchw/ FirewallInternet4DBServerVLAN 5VMTCPIPvSwitchOSA OSACisco Switchw/ FirewallFront EndvSwitchOSA OSACisco Switchw/ FirewallCisco Switchw/ FirewallBack EndVMTCPIPVMTCPIPvSwitchvSwitchvSwitchvSwitchOSA VLAN nNodeDBServerVMTCPIPHTTPServerVLAN 2WASNodeLPAR8LPAR4OSA OSAOSA OSACisco Switchw/ FirewallOSA OSACisco Switchw/ FirewallToolsThis information is for sharing only and not an endorsement by Nationwide InsurancevSwitchintranet

Definitions Disaster Recovery– “Disaster recovery in information technology is the ability of aninfrastructure to restart operations after a disaster. While many oftoday's larger computer systems contain built-in programs fordisaster recovery, standalone recovery programs often provideenhanced features. Disaster recovery is used both in the contextof data loss prevention and data recovery.”– Deal with complete outage Natural catastrophe Data center Major hardware failure5This information is for sharing only and not an endorsement by Nationwide Insurance

Our Environment Four z196 installed in 2011Current configuration:– Production boxes 36 IFLs 404GB memory 6 z/VM LPARs (plus system programmer configuration test) Tier 4 data center– Fully redundant power, telecom, generators, etc– Development boxes 21 IFL engines on development box 663GB memory 6 z/VM LPARs (plus system programmer test)6This information is for sharing only and not an endorsement by Nationwide Insurance

Planning ationTeamworkThis information is for sharing only and not an endorsement by Nationwide Insurance

Design What– Identify what needs to be recovered Everything or subset? Priorities – recovery order When– Need to know recovery objectives Where– Identify where recovery will occur Second site Vendor site How– Identify how to transfer programs and data– Identify how to perform recovery8This information is for sharing only and not an endorsement by Nationwide Insurance

Challenges Production configuration changes may require DRconfiguration changes––––Processor model changes (not necessarily)CapacityNetwork configurationTransition Consistent distribution and synchronization of boot andDR scripts Commitment to regular testing DR Infrastructure Application DR9This information is for sharing only and not an endorsement by Nationwide Insurance

Priorities Prioritizing your application recovery must be done by thepeople in your organization that understand the businessprocesses. Business requirements that drive recoverytime-frame– regulated– financial / investments– responsive to customers10This information is for sharing only and not an endorsement by Nationwide Insurance

Setup Asynchronous replication with SRDF between productionand recovery sites Replicated volumes at recovery site– z/VM different unit address between prod and dev/test/dr– SAN target WWPN and LUN differ between production andrecovery Changes for recovery are automated in Linux script run at boot Manual processes–––––11Initiate Clone copiesVary ECKD DR volumes onlineStart Linux serversUpdate DNS to reflect DR IP addresses for all serversMiddleware and application hard-coded parameters (e.g. IPaddresses)This information is for sharing only and not an endorsement by Nationwide Insurance

Setup The DR process would be the same for the followingfailures– System z failure– DASD / SAN storage frame failure Long distance (inter-date center) fiber failure– No DR required, (redundant links installed)12This information is for sharing only and not an endorsement by Nationwide Insurance

SetupFCP (x12)FICONIBMIBMz196z196(x8)EMC dDevOpenSystemsSANFibreFCP (x12)FICONIBMIBMz196z19613(x8)EMC V-MaxStorageECKDReplicationFibreThis information is for sharing only and not an endorsement by Nationwide InsuranceOpenSystemsSAN

Normal NReplicationFiberSANReplication14This information is for sharing only and not an endorsement by Nationwide Insurance

Failure HappensIf failure occurs:manually stopreplication; initiateDR Clones; BringECKD volumes online at DR site, startLinux serversProductionDRDMXDMXFiberSANFiber15SANThis information is for sharing only and not an endorsement by Nationwide Insurance

Recovery BeginsServers identifyDR configuration,change IPaddress, changeSAN parameters,register new IPwith DNS, startreplication southto north.ProductionDRV-MaxV-MaxFiberSANFiber16SANThis information is for sharing only and not an endorsement by Nationwide Insurance

Recovery BeginsServers identifyDR configuration,change IPaddress, changeSAN parameters,register new IPwith DNS, startreplication southto NFiber17SANThis information is for sharing only and not an endorsement by Nationwide Insurance

Automation Avoid manual processes– Dependence on key individuals– Prone to mistakes– Slow Automated processes– Requires only basic knowledge of environment and technologies in use– Accuracy– Repeatable– Faster– Does not mean build it once then ignore;Requires regular review and updates18This information is for sharing only and not an endorsement by Nationwide Insurance

Automation Automation begins at provisioning– DR configuration stored with production configuration– CMS NAMES file Contains all information about provisioned server Copy stored on DR disk also Also used to generate report of server definitions for easy lookup– Linux PARM file stored on CMS disk Stored on disk accessible at boot time Copy stored on DR disk also– Define everything needed to provision server and at boot time19This information is for sharing only and not an endorsement by Nationwide Insurance

Automation Extract from LINUX NAMES file forone d web server .1.1:vswitch.PRODVSW1:vlan.2102:ip nb.10.2.1.1:vsw nb.NETBKUP1:vlan nb.394020:ip dr.10.221.1.1:vsw dr.PRODVSW1:vlan dr.2102:ip drbu.10.222.1.1:vsw drbu.NETBKUP1:vlan drbu.3940:oth ip.10.1.1.5 10.1.1.15:dr oth ip.10.221.1.5 torage os.7.1G:bootdev.251This information is for sharing only and not an endorsement by Nationwide Insurance

Automation Extract from LINUX NAMES file for one guest (cont’d):storage 0000000000:8.43 R1:0100:5006048AD52D2588:005A000000000000:8.43 R1:0200:5006048AD52D2587:0059000000000000:8.43 rage san dr.16.86G:sanluns dr.R2:0100:5006048AD52E4F87:006A000000000000:8.43 R2:0100:5006048AD52E4F87:006B000000000000:8.43 R2:0200:5006048AD52E4F88:006A000000000000:8.43 is information is for sharing only and not an endorsement by Nationwide Insurance

Automation PARM fileHOST pzvmws001ADMIN 10.1.1.1BCKUP 10.2.1.1DRADMIN 10.221.1.1DRBCKUP 10.222.1.1ENV PRODDRVIP 10.1.1.5,10.1.1.15BOOTDEV 251VIP 10.221.1.5,10.221.1.15SAN 1 8AD52D2587:005A000000000000SAN 2 8AD52D2587:0059000000000000SAN 3 8AD52E4F88:006A000000000000SAN 4 8AD52E4F88:006B00000000000022This information is for sharing only and not an endorsement by Nationwide Insurance

Automation Alternate Start-up Scripts– Identify production or DR mode VMCP – interact with CP CMSFS – read CMS files– Set parameters for environment Hostname to /etc/HOSTNAME IP addresses to /etc/sysconfig/network/ifcfg-qeth-bus-ccw-0.0.xxxx SAN LUN information Color prompt by environment– Prod Red– DR Yellow– Tools Green23This information is for sharing only and not an endorsement by Nationwide Insurance

Automation Extract from boot.config# Setup variablesecho "1" /sys/bus/ccw/devices/0.0.0191/onlinesleep 5# modprobe required just in casemodprobe cpintPARMDEV grep 191 /proc/dasd/devices awk '{print 7}' NZVWS001NAT VN1NQUSERID hcp query userid GUEST echo QUSERID cut -d" " -f 1 LOCO echo QUSERID cut -c14 LPAR echo QUSERID cut -c13-15 BOX echo QUSERID cut -c1 cmsfscat -d /dev/ PARMDEV -a {GUEST}.PARMFILE /tmp/sourceinfo. /tmp/sourceinfoecho "0" /sys/bus/ccw/devices/0.0.0191/online24This information is for sharing only and not an endorsement by Nationwide Insurance

Automation Result of cmsfscatcat /tmp/sourceinfoHOST pzvmws001ADMIN 10.1.1.1BCKUP 10.2.1.1DRADMIN 10.221.1.1DRBCKUP 10.222.1.1ENV PRODDRVIP 10.1.1.5,10.1.1.15BOOTDEV 251VIP 10.221.1.5,10.221.1.15SAN 1 8AD52D2587:005A000000000000SAN 2 8AD52D2587:0059000000000000SAN 3 8AD52E4F88:006A000000000000SAN 4 8AD52E4F88:006B00000000000025This information is for sharing only and not an endorsement by Nationwide Insurance

Automation More extract from boot.configcase " ENV" inPROD)if [ " LOCO" " BOX" ]thenCLR "41"; #RedelseCLR "43"; #Yellow/GoldENV "DR";fi;;DEV JT TOOLS TOOL)CLR "42"; #Green;;ST)CLR "44"; #Blue;;PT)CLR "45"; #Purple;;UAT IT)CLR "46"; #Turq;;*)CLR "42"; #GreenENV "UNK";;;esacExamples:barlowr@szvmjt002:JT:barlowr barlowr@nzvmws001:PROD:barlowr 26This information is for sharing only and not an endorsement by Nationwide Insurance

Documentation Document everything– Declaration criteria– Contact information ––––27Operating SystemMiddlewareApplicationNetworkSecurityLists of serversRecovery processVerification processFail-back processThis information is for sharing only and not an endorsement by Nationwide Insurance

Documentation DR Procedure:– Confirm DISASTER declaration– Begin shutdown all test/development guests to insuresufficient capacity.– Bring up production DR guests identified by business unitsfor each application environment.– Make appropriate emergency DNS changes to point usersto DR environment per definitions for each applicationenvironment.28This information is for sharing only and not an endorsement by Nationwide Insurance

Documentation Return Procedure:– Confirm DISASTER OVER declaration– Reverse disk replication; confirm synchronization– Follow instructions for confirmation of original productionenvironment for each application.– Bring down DR guests identified by business units for eachapplication environment.– Make appropriate DNS changes to point users to non-DRenvironment per definitions for each application environment.– Resume normal disk replication.29This information is for sharing only and not an endorsement by Nationwide Insurance

Teamwork Recovery coordinatorz/VM System ProgrammersLinux System AdministratorsMiddleware– WAS Administrators– Database Administrators– MQ Administrators Application Teams– Testing methodology– Expected resultsAvoid processes that are dependent on subject matter experts (SME)when a disaster happens30This information is for sharing only and not an endorsement by Nationwide Insurance

Execution 31TestDocument resultsCompare to planRepeatThis information is for sharing only and not an endorsement by Nationwide Insurance

Execution Where – to recover the systems Your own second site A recovery vendor– do the people go Identify what personnel need to travel to recovery site– Document travel procedures Identify alternate (local) office space– Some office locations may be able to access recovery siteif connectivity is available32This information is for sharing only and not an endorsement by Nationwide Insurance

Execution Testing– Test as often as feasible Frequency may depend on having your own site or contractingwith a vendor– Tests should be as close as possible to real recoveryconditions– Operating systems are easy– Some subsystems are not so easy (e.g. large database)– Multi-platform applications can be are more complex– Automate as much as possible to avoid manual effort33This information is for sharing only and not an endorsement by Nationwide Insurance

Document ResultsCompare to Plan 34Detailed plans for all test scenariosCarefully track testsDocument action items and follow up for improvementsBuild on successesThis information is for sharing only and not an endorsement by Nationwide Insurance

Repeat Do it again Do it regularly Corporate emphasis may be required to encourage allapplications to test35This information is for sharing only and not an endorsement by Nationwide Insurance

Contact Information“And I thought we were busy before Linux showed up!”Rick BarlowSenior z/VM Systems ProgrammerPhone: (614) 249-5213Internet: Richard.Barlow@nationwide.com36This information is for sharing only and not an endorsement by Nationwide Insurance

Disaster Recovery -"Disaster recovery in information technology is the ability of an infrastructure to restart operations after a disaster. While many of today's larger computer systems contain built-in programs for disaster recovery, standalone recovery programs often provide enhanced features. Disaster recovery is used both in the context