Oracle Restart And FSFO In Cloud

Transcription

Oracle Restartand FSFO inCloud0

I. BackgroundII. Core Technology– Hypervisor– Hypervisor HA– Oracle Restart– Oracle ADG– Oracle FSFOIII. Restart and FSFO in CloudIV. IssuesV. Non Stop Cloud Active Data CenterAgenda1

How to build Oracle in the Cloud2

BackgroundExamineNew sers NeedsRetain Oracle DBHigh AvailabilityReviewCloud relatedTechnologyStandardizeCloud ArchitectureSpreadArchitectureLow CostNeed for high availability architecture in cloud3

Core Technology Hypervisor : Virtualization solution that manages multiple operating systemsHypervisor Type1(Bare-metal)OSOSOSHypervisor Micro KernelOSOS- Run on the Bare-metal Host Computer- Run on the Host OS- Bare-metal Hypervisors. UnixIBM PowerVM(Micro Partitions/VIO)Oracle VM Server for SPARC(LDOM). x86VMware ESX/ESXi, MS Hyper-VCitrix XenServer, Oracle VM Server for x86- Hosted Hypervisors. UnixHP Integrity VM. x86VMware Server/workstation/FusionMS Virtual Server/PCOracle Virtual BoxKVM4

Core Technology Hypervisor HADiagramFeatures- Fail OverPool. If one of the hosts fails, its VMs restartautomatically on other hosts in the same poolDB- Live Migration. Move a VM from one host to anotherHypervisorHypervisorHypervisorwith no downtime- Hypervisor HA. VMWare HA. XenServer HA. Oracle VM HA5

Core Technology Oracle Restart : HA Solution for Standalone Oracle DBDiagramFeatures- Oracle Restart runs periodic check operationsto monitor the health of Oracle raagentDisk GroupOracle ASMOHASRestartsuch as Database, Listener, ASM, ASM DGand Service- If the check operation fails for a component,the component is restarted- Oracle components can be automaticallyrestarted, whenever your database hostcomputer restartsListener6

Core Technology Active Data GuardDiagramFeatures- Protection from database file corruptionsRead WriteRead OnlySync/Async- Active Data Guard enables a physical standbydatabase to remain open read-only, whileactively applying updates received from theprimary databasePrimary DBStandby DBOracle applyUPDATESControlFilesOnlineLogsArchiveLogsData sOnlineLogsArchiveLogs- When Oracle detects corrupted blocks at theprimary database, it will repair them online bycopying the good version from an activestandby database (and vice versa)Data FilesSYSTEMUSETEMPUNDO7

Core Technology Fast-Start Failover : Oracle Data Guard HA Solution (DG HA)DiagramFeatures- If Observer detects primary DB failure,automatically fails over to nominated standbydatabase- Failover is triggered under following scenarios:. Instance failuresDataGuardObserver. Shutdown abort. Offline datafile due to errors. Dictionary Table corruptionsSYNC//ASYNCASYNCSYNC- Once the primary is accessible again, Observerwill re-connect and re-create a new standbyDB using flashback database technologySite APrimaryStandbySite BStandbyPrimary- Integrated with GI (RAC or Restart). Failed primary automatically reinstated asstandby database. Automatically start role-based services8

Restart and FSFO in Cloud Cloud Architecture before Restart and FSFODiagramWeak points- No support for OS level HA solutionsXenServer PoolDB- Only Hypervisor HA is applied. Does not detect DB crashesXenServer HA. After a DB Crash, OS Reboot or Hypervisorfailover is needed; DBA must start DB manually- DR Solution is not applied in case of:HypervisorHypervisorHypervisor. Hypervisor Pool Down. Storage or Database file Failure9

Restart and FSFO in Cloud Architectural Improvement using Restart and FSFOXenServer HA : Hypervisor DownOracle Restart : DB Crash in 30secOracle FSFO : Hypervisor Down, Hypervisor Pool Down, DB File Corruption, DB Crash over 30sec, OS RebootFSFOSYNCDBRestartHypervisorDBHypervisor ypervisor10

Restart and FSFO in CloudFailover Duration Time (min)DB Down200.50.5OS Reboot1Server Down255510ServerCloud(without Restart/FSFO)301010Restart can reduce detect/start DBtime (our assumption: 20 min)152025RestartRestart/FSFO3035Restart and FSFO Reduce Failover Time !!ItemServer Cloud(As-Was)Only RestartRestart FSFOBlock Corruption Manual Recovery(Over 3hr) Manual Recovery(Over 3hr) Auto RepairDB File Corruption Manual Recovery(Over 3hr) Manual Recovery(Over 3hr) Failover to Standby (in 1min)Storage Down Manual Recovery(Over 3hr) Manual Recovery(Over 3hr) Failover to Standby (in 1min)Hypervisor Pool Down Manual Recovery(Over 3hr) Manual Recovery(Over 3hr) Failover to Standby (in 1min)FSFO Reduces Recovery Time Significantly !!11

Restart and FSFO in Cloud The Result of Availability TestCategoryOSDBObserverDG BrokerHypervisorItemRecoveryTimeHA SolutionDescriptionDB Server Reboot 46sec FSFOExecuted Failover to Standby and Standby ReinstatementautomaticallyObserver ServerReboot 0sec MonObserver.shObserver Restarted automatically after rebootingDB LAN Card Fail 44sec FSFOExecuted Failover to Standby and Standby ReinstatementautomaticallyDB Instance Crash 26sec RestartDB Instance was restarted automaticallyDB Listener Crash 0sec RestartListener was restarted automaticallyGI Stop 39sec FSFOExecuted Failover to Standby, but Standby should bereinstated manuallyDatafile Write Fail 32sec FSFOExecuted Failover to Standby, but Standby should bereinstated manuallyObserver Fail 0sec MonObserver.shObserver Restarted automaticallyManual Switch Over 15sec DG BrokerExecuted Switch Over by DG BrokerManual Fail Over 15sec DG BrokerExecuted Failover and Automatic Standby ReinstatementLive Migration 0sec XenServerMigrated to other Hypervisor onlineMonObserver.sh : Observer Restart Script, registered as a cron jobMaximize Availability using Restart and FSFO12

Issues Observer Monitoring and Restart Script (MonObserver.sh)Only EM supports Observer HARESTART NORMALDOWN "N" ORACLE HOME/bin/dgmgrl -silent sys/ PW@ TNSALIAS "show database ' DBNAME'" ObHome/ChkObserver.logObserverDown grep -c "ORA-16819" ObHome/ChkObserver.log sed 's/ //g' ObserverCrash grep -c "ORA-16820" ObHome/ChkObserver.log sed 's/ //g' if [ " ObserverCrash" -ne "0" ]thenecho "Restarting Crashed Observer at date " ObHome/MonObserver.log ORACLE HOME/bin/dgmgrl -silent -logfile ObHome/StopObserver.log sys/ PW@ TNSALIAS "stop observer;" ORACLE HOME/bin/dgmgrl -silent -logfile ObHome/Observer.log sys/ PW@ TNSALIAS "start observer FILE ' ObHome/Observer.dat'" &fiif [[ " ObserverDown" -ne "0" && " RESTART NORMALDOWN" "Y" ]]thenecho "Starting Shutdown Observer at date " ObHome/MonObserver.log ORACLE HOME/bin/dgmgrl -silent -logfile ObHome/StopObserver.log sys/ PW@ TNSALIAS "stop observer;" ORACLE HOME/bin/dgmgrl -silent -logfile ObHome/Observer.log sys/ PW@ TNSALIAS "start observer FILE ' ObHome/Observer.dat'" &fiORA-16819 : The observer process was shut down normallyORA-16820 : The observer process was terminated unexpectedly13

Issues Add Listener (LISTENER DG) for private NWOracle Restart does not support network resources1. Modify listener.oraObserverPublic NWPublic NWLISTENERlistener.oralistener.oraPrivate NW(Sync)LISTENER DGPrimaryDBStandbyDBSID LIST LISTENER (SID LIST (SID DESC (SID NAME TAEKDB)(GLOBAL DBNAME TAEKDB DGMGRL)(ORACLE HOME /oracle/app/oracle11/product/11.2.0/dbhome 1)))LISTENER DG (DESCRIPTION (ADDRESS LIST (ADDRESS (PROTOCOL TCP)(HOST Private IP )(PORT 1531))))LISTENERSID LIST LISTENER DG (SID LIST (SID DESC (SID NAME TAEKDB)(GLOBAL DBNAME TAEKDB DGB)LISTENER DG(ORACLE HOME /oracle/app/oracle11/product/11.2.0/dbhome 1)))2. Add LISTENER DG srvctlsrvctlcrsctlcrsctlcrsctl for srvctl srvctladd listener -l LISTENER DG -p 1531start listener -l LISTENER DGstatus res -tstat res ora.LISTENER DG.lsnr -pmodify resource ora.LISTENER DG.lsnr -attr "ENDPOINTS "private endpoint only (MetaLinkID:1544433.1)stop listener -l LISTENER DGstart listener -l LISTENER DG14

Issues Public NIC FailureEven when the Primary Public NIC Fails, the Private LAN continues to be available Failover to Standby not executed, even when service down scenario encounteredObserverOverride (New Features in 11.2.0.4)ObserverPublic FailureLISTENERLISTENERSyncPrivate OKProperties:FastStartFailoverThreshold '30'OperationTimeout '30'FastStartFailoverLagLimit '30'CommunicationTimeout '180'ObserverReconnect '10'FastStartFailoverAutoReinstate 'TRUE'FastStartFailoverPmyShutdown 'TRUE'BystandersFollowRoleChange 'ALL'ObserverOverride 'TRUE'ExternalDestination1 ''ExternalDestination2 ''PrimaryLostWriteAction 'CONTINUE‘ObserverOverrideLISTENER DGLISTENER DG The ObserverOverride configuration property, when set to TRUE,allows an automatic failover to occur when the observer has lost connectivityto the primary, even if the standby has a healthy connection to the primary.PrimaryDBStandbyDBApplication of Monit, an open source HA solution ( 11.2.0.3)cat /etc/monit.confcheck host myserver with address xx.xxx.xxx.xxx ( Public IP)# start program ""stop program tl stop has"if failed icmp type echo count 10 with timeout 3 secondsthen stop15

Issues TCP Connection TimeoutOracle does not support VIP between Primary DB and Standby DB- Linux TCP Connection Timeout is 21sec : tcp syn retries (Default 5(3 6 12 21sec))tnsnames.oraWASConnection Timeout !!Primary IP DownStandby IPSyncPrimaryDBStandbyDBORCLRW (DESCRIPTION LIST (LOAD BALANCE OFF)(FAILOVER ON)(DESCRIPTION (ENABLE BROKEN)(CONNECT TIMEOUT 5)(TRANSPORT CONNECT TIMEOUT 3)(RETRY COUNT 1)(ADDRESS (PROTOCOL TCP)(HOST Primary Public IP)(PORT 1521))(CONNECT DATA (SERVER DEDICATED)(SERVICE NAME orclrw)))(DESCRIPTION (ENABLE BROKEN)(CONNECT TIMEOUT 5)(TRANSPORT CONNECT TIMEOUT 3)(RETRY COUNT 1)(ADDRESS (PROTOCOL TCP)(HOST Standby Public IP)(PORT 1521))(CONNECT DATA (SERVER DEDICATED)(SERVICE NAME orclrw))))The CONNECT TIMEOUT parameter is equivalent to the sqlnet.ora parameter SQLNET.OUTBOUND CONNECT TIMEOUT and overrides it.The TRANSPORT CONNECT TIMEOUT parameter is equivalent to the sqlnet.ora parameter TCP.CONNECT TIMEOUT and overrides it.16

Issues TCP Keep AliveOracle disables Keep Alive setting on Client- Linux TCP Keep Alive Parametertcp keepalive time(Default 7200, 2hr)Last activity heretcp keepalive intvl(Default 75, 75sec)*tcp keepalive probes(Default 9)Probe starts hereConnection dropped1. Modify TCP Keep Alive Parameters/etc/sysctl.confnet.ipv4.tcp keepalive time 30net.ipv4.tcp keepalive intvl 5net.ipv4.tcp keepalive probes 32. Enable TCP Keep Alive (ENABLE BROKEN) ORACLE HOME/network/admin/tnsnames.oraORCLRW (DESCRIPTION LIST (LOAD BALANCE OFF)(FAILOVER ON)(DESCRIPTION (ENABLE BROKEN)(CONNECT TIMEOUT 5)(TRANSPORT CONNECT TIMEOUT 3)(RETRY COUNT 1).)17

Non Stop Cloud Active Data pervisor금융 PoolPoolPoolPoolHypervisorHypervisorA visorDB3RACDB2DB3RACDB1 HypervisorPoolPoolPoolHypervisorHypervisorDB2 HypervisorHypervisorHypervisorDB3 HypervisorDB2DB1FSFODB2DB3DB2 HypervisorDB3 HypervisorDB1RACSYNC/ASYNCDB1 visorADG(Manual) orDB RestartRestartRestartRestartDBHypervisorDB Restart금융 PoolPoolPoolPoolHypervisorRestartDBHypervisorDB CRestartHypervisorHypervisorHypervisorFSFODB RestartRestartRestartRestartDBHypervisorDB RestartDB RestartRACRestartPodPodsRestartDBDBRestartDBB ZoneRestartRestartPodPodsHypervisorHypervisorC pervisorHypervisor18

FSFORestartRestartWe can build Oracle in the Cloud19

FSFO Reduces Recovery Time Significantly !! Restart and FSFO Reduce Failover Time !! Restart can reduce detect/start DB time (our assumption: 20 min) 12 The Result of Availability Test . Restart and FSFO in Cloud . Category . Item . Recovery Time : HA Solution . Description : OS . DB Server Reboot 46sec FSFO . Executed Failover to Standby and Standby Reinstatement automatically . Observer .