The Oracle DBA's Guide To Disaster Recovery

Transcription

The Oracle DBA’s Guide toDisaster RecoveryDbvisit Software Limited.All Rights ReservedMay 2020

ContentsExecutive Summary.3Introduction to Disaster Recovery.3Disaster Recovery Methods.4Backups.4Block Copy.6Logical Replication.7Physical Replication.8Scripting.9Summary of DR Methods.9On-Premises, The Cloud, and Hybrid nclusion.13Bibliography.142

Executive SummaryThis eBook examines the Disaster Recovery methods available to Oracle Database Administrators. The Disaster Recoverymethods we discuss are no Disaster Recovery, backups, hardware/block copying, logical replication, physical replication, andscripting.It then looks at the options for delivering Disaster Recovery with Oracle SE in the cloud, and the configurations available.It discusses Dbvisit Standby as an example of a cost-effective and proven product delivering DR for Oracle Standard Editiondatabase.Introduction to Disaster RecoveryThe very word disaster is one that conjures up vivid images for many people. Some might think of Godzilla destroying acity, others may see it as a wedding cake not delivered on the big day or a delivery that wasn’t made overnight. For OracleDBAs, disaster often means something bad happened to their database.Different types of disasters can happen to DBAs and their database. It could be a natural disaster such as a flood orhurricane. It could be a regional disaster where a whole city loses power or it could just be that the data center or buildingloses power. The disaster could be something that affects all of the servers and databases such as the scenarios listedabove or there could be something specific to a server like a disk failure or hardware corruption.It might even be a human error – yes, we are all just human, we make mistakes, it happens. It could be a malicious act bya hacker, an honest mistake by a junior DBA or logical corruption from a bad batch job. As you can see there are manytypes of disasters that can happen.An Oracle DBA needs to be prepared for all different types of disasters. Pivotal IT reports that in the past two years, over50 percent of businesses experienced an unforeseen interruption, and the vast majority (81%) of these interruptionscaused the business to be closed one or more days. (Sinderman, 2016) Many of the different preparations that are neededfor one type of disaster can be used for other scenarios. Being prepared means thinking of the different events that canoccur and being ready for them. While an Oracle DBA will have specific thoughts and needs, the whole business shouldhave a Business Continuity Plan that shows how the business can function in the wake of different types of disasters.Not all disasters are huge disasters but to a business, any type of downtime could be considered a disaster. While ITorganizations expected to deliver uptime of 100% and with the nature of having IT systems available 24/7 365 days ayear, the pressure is to avoid downtime at all costs and keep the systems available.It is important that the Oracle DBAs as part of the IT department, talk with the business units to learn about the SLA(Service Level Agreement) for each of the applications and systems that they support. What are the expectations foruptime? What are the consequences of downtime? Does the IT department know what is needed in terms of where theapplication fits in the scheme of the overall business? Have the business units conveyed how much downtime or data losswould mean in terms of revenue or productivity?Conversely, have the Oracle DBAs articulated the specifics of how to meet the SLAs? It is one thing to say you need tobe up and running in 5 minutes, but is the business unit willing to pay for that? If zero data loss is truly the goal, are thefull costs of having a full redundant highly available system willing to be borne? Oftentimes there needs to be a completeopen and frank discussion so that appropriate plans can be put in place that are cost effective, and can meet the needsof the specific application. Pivotal IT has found that companies are still failing to put strategic contingency plans in place.Analysts from Hughes Marketing Group have found that over the course of a month, 90 percent of small companies(fewer than 100 employees) spend less than 8 hours planning or managing their business continuity plans. (Sinderman, 10Backup and Disaster Recovery Statistics You Must Know, 2016)Some systems may truly need to be up and running in 10 minutes with a maximum of 10 minutes’ worth of data loss.Other systems may have to be back up and running in 24 hours and 1 hour of data loss is acceptable. This will vary byapplication and by business unit.Once the parameters are set for the particular SLAs, an Oracle DBA can dig in to the many different Disaster Recoveryoptions that are out there. While many different options exist, and while many options may work in different scenarios,3

it is important to examine how they work, what they mean in terms of maintenance, ease of use, cost and effectiveness.There are six different types of Disaster Recovery options that we will cover. The six different methods are: Scripting Hardware/block copying Physical Replication (Oracle Data Guard/ Dbvisit Standby) Logical Replication Backups No Disaster RecoveryBefore delving into the different methods of Disaster Recovery for the Oracle database, a DBA should ensure that theOracle database is in Archive log mode. Archive log mode ensures that an Oracle database keeps copies of the Oracleredo logs in the form of archive logs – and a prerequisite if you are looking at performing online backups. Those archivelogs along with a proper backup of the Oracle database mean that a history of transactions are available. Surprisinglyhaving archive log mode enabled is not the default mode for Oracle databases. It is however very easy to enable withsimple commands and is a basic requirement for any database in production.Disaster Recovery MethodsNo Disaster RecoveryThe first choice might be the scariest choice of them all. That option is to do nothing. Meaning no Disaster Recovery planat all. While on the face of it that seems a bit far fetched, in reality it is all too common. There are several reasons whythis might be the case.The first reason is because often Disaster Recovery is thought of after the implementation phase. By that stage, the teammay have run out of budget or they might be on a strict timeline and think that the addition of a Disaster Recovery planwill slow down the implementation of the project.Another reason is many DBAs think that having a proper plan is too complex. While this might be a horrible reason, alltoo often it actually occurs. Meanwhile, DataCore found that more than half of companies (54%) experienced a downtimeevent that lasted more than 8 hours – one full work day – in the past five years. (DataCore, 2018)Cost is a third reason. Oracle DBAs will sometimes properly scope things out and present a budget and the business unitsreply that they don’t want to pay for Disaster Recovery as it is too costly. It can be compared to gambling, are these peoplewilling to take the risk that nothing bad will happen to their databases?A fourth reason is that sometimes a database starts off as not being a mission critical database. It might be a small‘unimportant’ application that is not seen as critical to the business. Over time, that application might become moreand more important. Yet back before it became important, it was not in the business continuity plan. Once this detail isrealized, a plan should be put in place as quickly as possible.While these might not seem to be big reasons, all of these examples exist in today’s IT departments throughout the world.BackupsAlthough it may be a surprise that some Oracle database users have no Disaster Recovery protection, it should come asno surprise that some customers use backups as their only protection mechanism. There are three main ways in whichcustomers may back up their database: operating system backups, backups using Oracle tools such as Recovery Manager(RMAN) and third party solutions.Operating system backups used to be the go to solution for all Oracle backups. There were two methods that would beemployed. Hot and Cold backups. Cold backups, also called offline backups, are quite simple where you shut the databasedown and using an operating system command and copy all of the relevant datafiles. Care should be taken to also make sureyou have all of the archive logs backed up as well. Many Oracle DBAs will backup their databases every day or every fewdays. The drawback of the Cold backup is downtime – which is not ideal in the time we live in; and it may take quite sometime to do a backup.4

If the backup takes 8 hours, that means the DBA will have to ensure that the backups are done by the time work starts in themorning and that the database is back online and operational. This becomes harder as databases often have to be online forlonger periods in these days where most want 24/7 uptime.As planned downtime shrinks, Oracle DBAs could turn to Hot backups. Hot backups are also referred to as online backups ordynamic backups. Hot backups allow backups to be taken while the database is up and running. This is a huge improvementover Cold backups. Not taking the database offline allows end users to have continual access while these backups take place.Because you aren’t taking the databases offline, DBAs tend to take more frequent backups.DBAs run commands to put the database (or tablespaces) in Hot backup mode, back up the datafiles using operating systemcommands, and when that finishes they take the database (or tablespaces) out of backup mode.Recovery Manager (RMAN) takes all of the ideas around the Hot backup and vastly simplifies them. Oracle has been improvingRMAN in every release since its release with Oracle version 8i. RMAN takes much of the guess work out of backups, Oraclehandles backing up the datafiles, checking for corruption, correlates all of the backups with other backups and they caneven purge older backups that are no longer needed. Oracle can do the backups with multiple streams for faster backupstoo. RMAN has a variety of options to cover a myriad of needs for the Oracle DBA. RMAN commands are easy to learn andshould be used by all Oracle DBAs.A third option to backup databases is using a third party tool. There are many third party tools on the market. Some of themuse similar methods to RMAN, making it easier to do backups with simple commands and easy to use GUIs. The downside isthat whereas RMAN comes at no additional charge to the Oracle database, third party tools cost money and some of themcan be quite costly.Backing up your database is extremely important and should be done so on a regular basis. However, it is important to notethat backups are not the same as true Disaster Recovery. When taking into to consideration the SLAs of the business unit,can a backup program meet the required Recovery Time Objective (RTO) and Recovery Point Objective (RPO)?If a backup is taken every night at midnight and stored offsite, and a disaster occurs at 4pm when the data center is wipedout - the potential of 16 hours of data loss is there. Can that system afford 16 hours of data loss? This is where Oracle DBAsoften think that more frequent backups are the answer. However, if you take more backups, you may start impacting theproduction system as the backups may take more resources during the backup procedure.In that same scenario, imagine that the backup is available, but do you have a second (backup) server with Oracle installed onwhere you can restore the database backup onto? How long does it take to provision a new server, install Oracle and thenrestore the backup? This will impact the RTO and it is a main factor when deciding if backups are a ‘good enough’ DisasterRecovery (DR) solution. Backups are certainly a part of a DR plan but all of the associated issues must be thought through.5

Block CopyA third method of Disaster Recovery (DR) solutions is that of block copying. Block copying comes in two main flavors; virtualmachine copying or SAN replication. Block level copying is a very simple concept in that all of the blocks on the primaryserver storage are copied byte for byte over to the secondary Disaster Recovery server. The great thing is that it is an exactreplica of the source side.A Storage Area Network (SAN) is a dedicated network connecting storage on multiple servers. SAN replication ensures thatany block on the source server is copied over to the target server. The systems are typically fast and have become evenfaster over time.Virtual Machines (VMs) have grown very popular recently. The growth of virtual machines has exploded with the increase incloud computing. A virtual machine is a software enabled server. Multiple environments and operating systems are run withspecific configurations that are controlled by something called a hypervisor. Hypervisors allow the physical server to be astep removed. Hypervisors allow multiple VMs to run on the server and make efficient use of memory, network, storage andCPU from the physical server that is shared among the virtual servers.There are a few major reasons that block copying could be a perfect solution for Disaster Recovery. If you already own a SANsetup, adding another database and application into the mix would not be hard to do. The cost of the SAN is already thereand it is simple to just add another application.The same holds true if you are adding another VM to an already existing setup.Another key feature is that SAN or VM replication can also replicate the application server as well as the database. Whileapplications don’t typically change all that much, having them replicated can often give extra peace of mind.If an organization has not implemented VMs or SANs, then costs can be quite considerable and could be a major factor whenthinking of installing this solution. Some of these solutions can be quite expensive.Another major drawback to block copying is the amount of data that is transferred to the target side. When a user updatesdata in the Oracle database, changes are made to actual datafiles, to undo, and redo logs.6

Eventually those redo log changes are copied into an archive log. That same data resides in four different places. Block copyis often considered ‘zombie like’. It doesn’t know the difference and copies all of the changes that were made. This may resultin data being transferred 3-4 times from the primary site to the standby site.These block copying methods are often an ‘all or nothing’ solution. The danger lies in the fact that if there was corruption ofthe data on the primary database, that would transfer over to the standby database. This applies to logical corruption of thedata as well as physical corruption from hardware failures. However, some vendors do check for physical corruption whenthe blocks are applied on the target.Logical ReplicationLogical replication is another method which many DBAs now look to for Disaster Recovery. Logical replication started withOracle Streams in Oracle 9i. Oracle Streams has been deprecated and with Oracle’s purchase of GoldenGate in 2009, logicalreplication is continually enhanced in the Oracle suite of products. There are also several third party logical replication tools.Logical replication tools mine the Oracle redo logs for the changed data. Those changes are then put into physical files,shipped to a target server, transformed into SQL statements and then applied to the target database. That target database isin read/write mode and can typically be accessed by users. If being used for proper DR, many DBAs will restrict those targetdatabase users to have read only access so that the target database stays ‘pure’.Because the target database is continually updated automatically there is very little lag between the source and targetmachines. The fact that the Oracle database is already up and running on the target side, combined with the fact that thereis very little lag between the source and target, make logical replication an attractive solution to those who are looking forDR solutions.There are a few reasons that many people like the logical replication route. One is that it allows the standby DR site to beused for reporting in real-time. For those that require reporting in real-time, logical replication allows this - in fact, that is oneof the very reasons that logical replication was developed.Another benefit is that logical replication will allow you to perform ‘rolling upgrades’ that would allow the standby site andprimary site to be different versions of the Oracle database. In fact, the two sites could even be on different versions of theoperating system or even on different operating systems all together.7

True DR would dictate that the primary and standby sites are exactly the same. In a disaster, when a Failover occurs, havingthe target being an exact replica of the source will give a level of comfort to the DBA dealing with the disaster. A third partyDR solution such as Dbvisit Standby can provide this for your company’s critical data.There are two major downsides to logical replication for Disaster Recovery. The first is cost. Many of the logical replicationtools can be particularly expensive. The second major reason is that logical replication tools by their very nature don’treplicate everything. They only replicate the changes to tables (or schemas) that you have told them to replicate.Additionally, many of the logical replication tools have certain datatype restrictions that they are unable to replicate. If youare using any of those datatypes, you may not be able to use logical replication for DR.Logical Replication can also be more complex to set up. There are many different pieces and it can be more involved versussome of the other methods such as a third party DR solution, for example, Dbvisit Standby.Physical ReplicationWhile logical replication looks into the redo logs, extracts the data changes, converts the changes into SQL statements andthen runs the SQL statements, physical replication ships binary archive log files from the primary database to be applied tothe standby database. The standby database is typically in recovery mode and waiting for the new archive logs.The source and target databases are exactly the same and remain the same with the shipping of physical files. *Note:there are some examples where physical replication can go cross platform (Microsoft Windows to Linux) but those areoften used for data migration, and usually this is not a perfect DR solution as the standby database does not perfectlymimic the primary database.If you are using Oracle Enterprise Edition, this typically means that you are using the built in feature for your Oracledatabase, Oracle Data Guard. This Enterprise Edition only feature is the standard for Disaster Recovery solutions. Oraclerecommends that you use this feature as it is built and developed by Oracle and it is Oracle database aware. This featurecan be accessed via the command line or via Oracle Enterprise Manager.If you are using Oracle Standard Edition, you won’t have access to Oracle Data Guard. If you still want to use a physicalreplication method you will need to purchase a third party tool like Dbvisit Standby (link to: https://dbvisit.com/oursolutions/dbvisit-standby). This product on a high level has the same concept as Oracle Data Guard - shipping archive logsfrom the primary database to the standby database.The one real criticism of physical replication is that the DR site should be exactly the same as the primary site. This meansthat the operating system and the Oracle database version have to be the same on both sides before bringing in a productlike Dbvisit Standby to take care of your critical data. However, as mentioned above this is not a true criticism, as bestpractice is to have the primary and standby sites with identical setups in any case.8

Designed by DBAs for DBAs, Dbvisit Standby gives you the confidence that you need to ensure business continuityand protect your existing infrastructure, whether on-premises, a hybrid environment or in the cloud, from unexpectedoutages that could put your critical data at risk.ScriptingScripting can be considered a subset of Physical Replication. Scripting is basically the idea that a person will write a scriptto simulate the commands that are software driven in some of the other methods. Some scripts will try to duplicate filereplication by having a script copy files from one server to another. Depending on the level of sophistication of the server,it might have very old files in storage and not many DBAs rely on this method.Most of the scripts duplicate the physical replication methods. There is not much point to duplicating the actions of DataGuard for Oracle Enterprise Edition databases as Data Guard is a ‘built in’ capability.Scripting is much more common when users have Oracle Standard Edition databases. DBAs will create scripts to copy andship the archive logs from a source server to a target server. From there, another script would take over and have thosearchive logs applied to the target server. Scripts are working well and many organizations use them.However, there are many dangers of relying on scripting. The first reason that relying on a script is dangerous is becausethe scripts are often not fully tested. It is one thing to write a script and another to do detailed tests to make sure it works.Has the script been fully tested in many different disaster scenarios?Another reason that scripts are dangerous is often the scripts were written by one person but another person will be runningthem. As time goes on, the author may end up leaving and the DBA is left with a script that is often not well documented andthe DBA is not very familiar with the script. Does the DBA want to trust that script in the event of a disaster? Often duringthe disaster, DBAs don’t want another unknown that they have to worry about. Fears and anxiety are high during disastersand scripting can add to that.Thirdly, when Oracle releases patches or new versions, the scripts are not typically tested against new versions and features.While these features may not affect the scripts, it is usually not worth the risk.A fourth reason that scripts might not be a good idea is because of the cost involved in maintenance. Making sure thatscripts are kept up to date with new database features and versions, along with making sure the scripts are working withany changes made to the operating system can take up a large portion of a DBA’s time. That time spent maintainingscripts could make scripting a cost ineffective solution. It may be cheaper to buy a solution.The nologging operation in Oracle will omit changed data from entering the Oracle archive logs. How will the scriptsrecover from such an operation? The usual answer is that the standby database will have to be recreated which can be atime intensive operation. Scripts, of course, can’t usually alert the DBA that a no logging operation has occurred, and thenby the time you find out (if you find out) it will be incredibly hard to recover the scripts.DataCore has found that site outages typically cost businesses at least 20k for every day of downtime, with more than aquarter of organizations reporting that one day of downtime would cost over 100k. (DataCore, 2018)The final answer of why scripts might not be the proper DR solution is that there is often no one to call for help if yourun into an issue. Whereas if you buy a third party DR solution, there is typically a help desk that has experts in DisasterRecovery that can help you with any issues that a DBA may run into.Summary of DR MethodsThere is no right or wrong answer when it comes to a proper Disaster Recovery solution. Planning a solution that fits theenvironment, the budget, and meets the SLAs of the business unit are what is important. If the application and databasewent into production without a DR plan, it is not too late and you can implement a system after the fact. Many organizationswill also opt to phase in solutions. They may start with just backups and then in the next budget cycle move to upgrading toa more robust solution as time and money warrant.Physical replication incurs the lowest overhead and ensures that the standby database (or databases) is kept consistent withthe primary database. There is no restriction on the types of data that can be replicated, and the complexity of maintainingconsistency and completeness is handled entirely by the database recovery mechanism. Dbvisit Standby provides Disaster9

Recovery and business continuity for Oracle databases by creating a copy of the original database and keeping it up to dateusing physical database replication.On-Premises, The Cloud, and Hybrid ConfigurationsDatabases don’t reside on isolated servers these days. While databases on-premises are still very much in use, the worldhas also exploded with databases on virtual machines as well as the ever popular cloud options.International Data Corp. estimates that companies lose an average of 84,000 for every hour of downtime. According toStrategic Research, the cost of downtime is estimated at close to 90,000 per hour. (ioSafe, 2019) There is at least onegood reason why you should be looking into a Disaster Recovery solution for your software, and we’ll take you throughthree different configurations you can look into.On-PremisesFor the past forty years, Oracle databases have been on different types of servers all over the world. Oracle customershave placed their databases on huge clustered machines but also tiny handheld devices.Historically, Oracle databases were installed on servers that resided in dedicated data centers either onsite or in a specialhosting facility if the organization did not have proper space and expertise to house the servers.DBAs could have one very large Oracle database on its own dedicated server or they could have dozens of Oracledatabases co located on one server. Much depends on how critical the database is to the organization as well as factorssuch as budget, skills and availability of servers.Oracle came out with Exadata allowing customers to have purpose built appliances with the Oracle database designedto exist on a specific hardware platform. Customers could scale up the size of their databases as well as consolidatemany databases onto one machine. Small and Midsized customers were not left out as Oracle soon released the OracleDatabase Appliance for those customers who did not need the power and size of an Exadata, but still wanted a purposebuilt machine.With the release of Oracle Database 12c, Multitenant databases were introduced. This allowed even more “databases”to be created in a single multitenant container, allowing shared memory and background processes, easier consolidatedpatching and upgrading, and the ability to manage and maintain many databases as if they were a single database.All of these databases on-premises still need to have a thoughtful Disaster Recovery program initiated for each database.After talking to the business units to discuss RTO and RPO goals for each database, a pr

The Disaster Recovery methods we discuss are no Disaster Recovery, backups, hardware/block copying, logical replication, physical replication, and scripting. It then looks at the options for delivering Disaster Recovery with Oracle SE in the cloud, and the configurations available. It discusses Dbvisit Standby as an example of a cost-effective .