Chaos Engineering: Finding Failures Before They Become Outages - Cloudinary

Transcription

Chaos Engineering:Finding Failures BeforeThey Become Outages

Chaos Engineering: Breaking Your Systems for Fun and ProfitDiane Glazman will never fly British Airways (BA)again. Glazman and her husband were amongthe 75,000 people affected by the three-day BAsystem failure summer 2017. On their way from SanFrancisco to their son’s college graduation in Edinburgh,they were stranded in London—without their luggage—atthe beginning of what was to be the three-week dreamtour of Scotland. “Listening to the excuses wasfrustrating because nothing explained why BA was sounprepared for such a catastrophic failure,” saysGlazman.BA lost an estimated 135 million due to that outage.The culprit turned out to be a faulty uninterruptablepower supply device (UPS)—the corporate cousin to the 10 gadget you can find in your corner Radio Shack. Andthat loss figure doesn’t count the forever-gone trust ofcustomers like Glazman, who will look elsewhere fortransatlantic flights next time she travels.BA of course isn’t alone for having suffered financiallyfor having its systems down. There were also UnitedAirlines (200 flights delayed for 2.5 hours, thousands ofpassengers stranded or missed connections),Starbucks (couldn’t accept any payments but cash inaffected stores), Facebook (millions of users offline andtens of millions of ads not served during the 2.5 hours ofdowntime), and WhatsApp (600 million users affected, 5billion messages lost). And when Amazon S3 went downin March 2017, it collectively cost Amazon's customers 150 million.1

Chaos Engineering: Breaking Your Systems for Fun and ProfitSuffered major outages in 2017In fact, 2017 was a banner year for systems outages—and for the cost of them.The 2017 ITIC Cost of Downtime survey finds that 98%of organizations say a single hour of downtime costsmore than 100,000. More than eight in 10 companiesindicated that 60 minutes of downtime costs theirbusiness more 300,000. And a record one-third ofenterprises report that one hour of downtime coststheir firms 1 million to more than 5 million (see Figure1). The average cost of a single hour of unplanneddowntime has risen by 25% to 30% since 2008 whenITIC first began tracking these figures.2

Chaos Engineering: Breaking Your Systems for Fun and Profit 1M – 5M % of CompaniesReportingCost of 1hr ofDowntime33% 1M – 5M 300,000 98%80% 100,000 300,000 Figure 1: Cost of 60 minutes of downtimeSo how can organizations cut the risk of downtime? Theanswer: break your systems on purpose. Find out theirweaknesses and fix them before they break when leastexpected.Break your systems onpurpose. Find out theirweaknesses and fixthem before they breakwhen least expected.It’s called chaos engineering, and it’s being adopted byleading financial institutions, internet companies, andmanufacturing firms throughout the world. Suchbusinesses understand that the trillions of dollars lostannually due to downtime is not acceptable to theircustomers, their stockholders, and their employees.3

Chaos Engineering: Breaking Your Systems for Fun and ProfitA more complex,distributed worldIn the traditional corporate computer environment of 30years ago, software ran in a highly controlledenvironment that had few moving parts or variables.But in the new business world that depends on theinternet, globally connected systems, a mix of cloud andbare-metal infrastructure, and more moving parts thanyou can count, your software depends on a lot ofinfrastructure and services that are outside your controlto run smoothly.Already MovedPlanning to Move70%Already Movedto the Cloud86%16%In Cloud by End of2017Planning to Move4

Chaos Engineering: Breaking Your Systems for Fun and ProfitFirst and foremost, there’s the rapid shift to the cloud. Afull 70% of companies have already moved at least oneapplication to the cloud, according to IDC, with 16%more planning to do so by the end of 2017.Then there’s the rise of microservices. Most enterprisesare developing their software today on a microservicearchitecture. Applications are built as small andindependent but interconnected modular services. Eachservice runs a unique process meant to meet aparticular business goal. For example, one microservicemight track inventory levels of products. Another mighthandle serving personalized recommendations tocustomers.Then there are all the web servers, databases, loadbalancers, routers, and more that must work togetherto form a coherent whole. This is not easy.The good news is that as this modular, distributedinfrastructure continues to evolve, businesses can dothings with software that simply weren’t possiblebefore. By shifting to what are also called ”looselycoupled services” that can be developed and releasedindependently of each other, developed by similarlyorganized teams who are empowered to make changes,time to market for businesses can be radically reduced.But these new capabilities come at a price.5

Chaos Engineering: Breaking Your Systems for Fun and ProfitToday, businesses face a serious “complexity gap.” It’sdifficult for even the most technically astute individualsto understand how all the different bits and pieces worktogether. The ability of your IT professionals to managethe ever-evolving sophistication of computerinfrastructure falls short of what’s needed. With somany points of possible failure in your systems, thismakes your business extremely vulnerable.This is happening at a time when people—businesspeople as well as consumers—are increasinglydependent on the internet and the services thatcorporations deliver that depend on it. The number ofinternet users worldwide in 2017 is 3.58 billion, up from3.39 billion in 2016.53% of mobile usersabandon webpages thattake more than threeseconds to load.Even fewer people today can operate without a phone.In 2017 the number of mobile phones reached 4.77billion, and is expected to pass the five billion mark by2019. And all these users--both consumers andbusiness--are demanding: Doubleclick found in 2016that 53% of mobile site visits are abandoned if a webpage takes longer than three seconds to load.As a result, the stakes have never been higher forcompanies to maintain the uptime of their systems.6

Chaos Engineering: Breaking Your Systems for Fun and ProfitChaos engineering,a primerImagine that it’s 1796, and you have been selected to beinjected with Edward Jenner’s brand-new smallpoxvaccine. You are told that he is going to put the actualvirus into your bloodstream. You are also told that youwon’t get sick. Instead, it will make it impossible for youto get sick with this particular virus, because it will makeyour system stronger. You might have recoiled, thinkingthat the risk was too great. Yet you allow yourself to bevaccinated. And, of course, you are much better off thanthose who refuse the treatment.By injecting a systemwith something that hasthe potential to disruptit, you can identifywhere the system maybe weak, and can takesteps to make it moreresilient.This is exactly what chaos engineering does. Allcomputers have limits, and possible points of failure. Byinjecting a system with something that has the potentialto disrupt it, you can identify where the system may beweak, and can take steps to make it more resilient.That covers the systems under your purview. Problemsthat occur with your cloud service providers are out ofyour control, and you can’t resolve outages by addingextra boxes or power supplies.7

Chaos Engineering: Breaking Your Systems for Fun and ProfitThink of all your data living in the cloud, in Amazon S3or DynamoDB, and the hosted services you depend on,such as Salesforce or Workday. If they fail, you're attheir mercy. Chaos engineering isn’t just essential foryour applications, it’s essential for the companiesbehind those applications, which is why Netflix, Uber,and Amazon all have teams dedicated to chaos andreliability: they know they cannot afford to let theircustomers down.Here’s where chaos engineering comes in: you know youhave these potential points of failure and vulnerabilities.So why wait until there’s a problem?Imagine attempting to break your systems. On purpose.Before they fail on their own. Because that is whatchaos engineering does. By triggering failuresintentionally in a controlled way, you gain confidencethat your systems can deal with those failures beforethey occur in production.The goal of chaos engineering is to teach you somethingnew about your systems’ vulnerabilities by performingexperiments on them. You seek to identify hiddenproblems that could arise in production prior to themcausing an outage. Only then will you be able to addresssystemic weaknesses and make your systems faulttolerant.8

Chaos Engineering: Breaking Your Systems for Fun and ProfitCritical to chaos engineering is that it is treated as ascientific discipline. It uses precise engineeringprocesses to work. Four steps in particular are followed.1.Form a hypothesis: Ask yourself, "What could gowrong?”2.Plan your experiment: Determine how you canrecreate that problem in a safe way that won’timpact users (internal or external).3.Minimize the blast radius: Start with thesmallest experiment that will teach yousomething.4.Run the experiment: Make sure to carefullyobserve the results.5.Celebrate the outcome: If things didn't work asthey should, you found a bug! Success! Ifeverything went as planned, increase the blastradius and start over at #1.6.Complete the mission: You’re done once youhave run the experiment at full scale inproduction, and everything works as expected.9

Chaos Engineering: Breaking Your Systems for Fun and ProfitSome examples of what you might do to the hypotheticsystem when performing a chaos engineeringexperiment: Reboot or halt the host operating system. Thisallows you to test things like how your systemreacts when losing one or more cluster machines. Change the host’s system time. This can be usedto test your system’s capability to adjust todaylight saving time and other time-relatedevents. Simulate an attack that kills a process. This canbe used to simulate application or dependencycrashes.The point of simulatingpotentially catastrophicevents is to make themnon-events that areirrelevant to ourinfrastructure’s ability toperform as required.Naturally, you immediately address any potentialproblems that you uncover with chaos engineering.Indeed, the point of simulating potentially catastrophicevents is to make them non-events that are irrelevant toour infrastructure’s ability to perform as required.Chaos engineering differs from the regular testing thateveryone does as a matter of course in several ways.Normal testing is done during build / compile activities,and doesn't test for different configurations orbehaviors or factors beyond your control. Additionally,routine testing doesn’t account for people--for trainingand preparing them for the failures they will beresponsible for fixing live, in the middle of the night.10

Chaos Engineering: Breaking Your Systems for Fun and ProfitBenefits ofchaos engineeringCompanies like Amazon, Netflix, Salesforce, and Uberhave been using chaos engineering for years to maketheir systems more reliable. For internet companieswhose very existence depends on their ability to be “up”at all times, chaos engineering was a necessity. Nowbusinesses in other industries—financial services inparticular—are starting to follow suit, and implementchaos engineering programs of their own.The benefits of chaos engineering include the following: Help technology professionals see how systemsbehave in the face of failure, as their assumptionsare often incomplete or inaccurate Validate that hypothetical defenses againstfailure will work when needed by exercising themat scale in production environments Provide the ability to revert systems back to theiroriginal states without impacting customers,employees, or consumers Save time and money spent responding tosystems outages11

Chaos Engineering: Breaking Your Systems for Fun and ProfitHow do you know if a chaos engineering program isworking? The top-level measure is overall systemavailability. For example, companies like Amazon orNetflix measure how available they are by whether theircustomers can use their product. They define availabilityin terms of “9s.” Four nines availability mean that asystem is available 99.99% of the time. Five ninesavailability is better, meaning the system is available99.999% of the time. Six 9s are even better. Specificapplications and sub-services are often measured usingthis metric as well.Translate these numbers into actual outage time, andyou see why it matters (see Figure 2). You can see whysix nines is today considered the gold standard ofreliability.Another metric is the frequency and duration ofoutages. Yet another metric is measuring theoperational burden of staff of system outages. Howoften did you have to page an IT support professional?How frequently did they have to answer a call at 2am tofirefight an issue?Chaos engineering is also good for disaster recovery(DR) efforts. If you regularly break your systems usingtight experimental controls, then when your systems godown unexpectedly, you’re in a much better position torecover quickly. You have your people trained, and youcan respond more promptly. You can even put selfhealing properties in place so you can continue tomaintain service with minimal disruption.12

Chaos Engineering: Breaking Your Systems for Fun and ProfitIf we break a system in a controlled and careful mannerand we make sure we can recover from it, then, whenoutages happen unexpectedly, we're in a better positionbecause our human workers are being trained torespond systematically to failure, as opposed to beingcalled at 2am.Systems can also self-heal, auto-recover so that theycan operate in a degraded state and still maintain theirservice levels. The goal is resilience, rather thanstability. Resilience meaning systems can gracefullyhandle inevitable failure without impacting users.Using chaos engineering for disaster recovery is alsoimportant for compliance reasons. Sarbanes Oxley II aswell as industry- or geography-specific regulatorymandates require that you can recover quickly from adisaster. But efforts to comply are often done at thetheoretical level, as so-called “table-top” exercises, andare therefore incomplete.No. of 9sAmount of downtimeFour nines (99.99% availability)52 minutes 36 secondsFive nines (99.999% availability)5 minutes 15 secondsSix nines (99.9999%availability)22 secondsFigure 2: Translating 9s of availability into minutes of downtime13

Chaos Engineering: Breaking Your Systems for Fun and ProfitBest practices inchaos engineeringMany businesses are skeptical that deliberately trying tocrash systems will make them stronger. And they arecorrect that there are risks. However, there are alsobest practices that mitigate those risks.Minimize the “blast radius.” Start with the smallestchaos experiment you can perform that will teach yousomething about your system. See what happens. Thenincrease the scope as you learn and as your confidencegrows.Don’t be a chaos monkey. Chaos Monkey was Netflix'sfamous—or infamous—tool that randomly rebootedservers. Unfortunately, today many people believe thatchaos engineering means randomly breaking things. Thereason this is not an optimal approach is that “random”is difficult to measure. You are not approaching theproblem using experimental methods. The idea behindchaos engineering is to perform thoughtful, planned,and scientific experiments instead of simply randomfailure testing.14

Chaos Engineering: Breaking Your Systems for Fun and ProfitBuild (using open source)BuyLimited set of tools availableGrowing availability of solutionsEveryone who does it is reinventing thewheelPiggyback on earlier successesCostly and time-consuming to traininternal teamAvoid engineer and administrator burnoutUnsupported open source releases openup security vulnerabilitiesSecurity embedded in solutionNo “kill switch” or safety valve to stop outof-control experiments from taking downproduction systemsKill switch to avoid impacting usersFigure 3: The debate between building and buying chaos engineering toolsStart in a staging environment. Yes, you musteventually test in your production system, but it makessense to start in a staging or development environmentand work your way up. Start with a single host,container, or microservice in your test environment.Then try to crash several of them. Once you've hit 100%in your test environment, you reset to the smallest bitpossible in production, and take it from there.15

Chaos Engineering: Breaking Your Systems for Fun and ProfitAvoid the “drift into failure.” This concept, invented byflight accident investigation expert Sydney Dekker,refers to the fact that tension always exists in systemsbetween efficiency and safety. Since businesses need tobe mindful of costs, they tend to operate on the edge ofsafety. So once you understand a particular kind offailure and you've tested it, you want to automate thetesting of it in a continuous deployment pipeline so youmaintain that competence.Always have a kill switch. This is akin to an “undo”button or safety valve. Make sure you have a way to stopall chaos engineering experiments immediately, on thespot, and return all systems to their normal state. Ifyour chaos engineering causes a high-severity incident(SEV), then track it carefully and do a full post-mortemanalysis of what went wrong.Fix known problems first. Never conduct a chaosexperiment in production if you already know that it willcause severe damage, possibly affecting customers —and your reputation. Always try to fix known problemsfirst.16

Chaos Engineering: Breaking Your Systems for Fun and ProfitConclusion:eschew complacencySystems will break. And as systems continue to grow incomplexity, they will break more often. If you're notprepared for that, they'll break in unexpected ways atunexpected times and bring your software or servicedown. You’ll have unhappy customers and unhappyemployees, and the costs are probably higher than youthink.Too many businesses are complacent. They think thatjust because they haven’t had a major system outagebefore--or one that directly impacted customers--thatthey’re safe. Or they think that the cost of deployingchaos engineering is more than the cost of simply fixingany problems that arise. They’re wrong. Companies thatdon’t address resiliency issues with chaos engineeringmay end up hiring tens or even hundreds of systemsadministrators just to maintain system uptime. Thatadds up.In today’s interconnected, internet-based world, no oneis safe from system failure. The only way for it to notimpact your customers, employees, partners, yourreputation, and your bottom line, is to proactivelyaddress it upfront. Chaos engineering is the optimal wayto do this.17

Chaos Engineering: Breaking Your Systems for Fun and Profit18

Chaos Engineering: Breaking Your Systems for Fun and Profit Companies like Amazon, Netflix, Salesforce, and Uber have been using chaos engineering for years to make their systems more reliable. For internet companies whose very existence depends on their ability to be "up" at all times, chaos engineering was a necessity. Now