Building Fault-Tolerant Applications On AWS - University Of North Florida

Transcription

BUILDING FAULT-TOLERANTAPPLICATIONS ON AWSGroup 2:David King, Robert Lowstetter,William Hackney

WHAT IS FAULT-TOLERANCE?“The ability for a system to remain in operation even if some of thecomponents used to build the system fail”Netflix describes Fault Tolerance as a “Requirement, Not a Feature”It does not matter what time of day/night it is. The web site orapplication is expected to be available 24/7! Unavailability will frustrate customers Lost customer satisfaction, loyalty. Loss of

AWS IS IDEAL!AWS is not unique to being ideally suited for building faulttolerant applications.You could do this on almost any platform but, it wouldconsume a lot of time and resources. THIS is what makes AWS unique in building Fault-Tolerant applications! You can build these Fault-Tolerant applications with little interactionand minimal investment. Remember our discussion of cloud elasticity? Elasticity being the ability of the system to adapt to change You want your systems resources to match its current use as much as possible

A SUPER-QUICK RECAP OF AMISAMI Amazon Machine ImageA template that holds a software configurationOperating System, Applications, etc.Gets applied to an instance type Instance Type What hardware is in the VM? You choose how much RAM and CPU meets the needs of your application. Remember elasticity! This is what made AWS unique in building Fault-Tolerant applications!

HOW DO AMIS HELP US?Say we have an application that we have built on AWSCustomers are live with the application! If this were to fail, our users would become quite frustrated and likely beginto dislike our application.We mitigate this by using AMIs! Say our application has failed and we do not know why. We need to get this back up and running before we lose consumer loyalty! We can easily replace the application! All we need to do is launch anotherinstance with the SAME AMI!

HOW DOES THIS WORK?There are many tools at your disposal that you could use for replacing the failingapplication. Command LineAWS Management ConsoleAuto Scaling Service (discussed later)Etc.We could have an instance already running as a backup in case our server were tofail (Amazon Block Stores; more details later )! All we would do is use an Elastic IP Address (described later) point to the failing instance and redirectthe old instance!ALWAYS HAVE A BACKUP!

UNDERSTANDING THE FIRST STEPThis was just the first step of Fault-Tolerance! As you can see, this is very useful for keeping our application constantly running andkeeping our consumers loyal!The ability to easily bring up a new instance when one fails is critical forrecovering from a failure!Let’s get a firm understanding of the importance of Fault-Tolerancethrough an example I believe everyone will understand,

HOW IMPORTANT IS FAULT-TOLERANCE?Look to the diagram to the right.This is how Netflix is typically used.We would have a plethora of users thatwould request to access multiple services ata time.What could cause a fault? There are somany possible ways that a dependencycould go down.

INTERRUPTED STREAMINGNew movie or popular show? Higher volume and traffic Fault Tolerance is necessaryNetflix mitigates fault via Elastic LoadBalancingThis is just one example of how theyimplemented fault tolerance in AWS!

ELASTIC BLOCK STORES (AMAZON EBS)Stepping away from Netflix EBS Offline storage; backups!Perfect for recovering from a faultBecause the new instance is a copy of the original you would loseessentially no data and no functionality!Annual fail rate for EBS is .1% and .5% rather than 4% in astandard data scenario! This is due to being stored redundantly (permits data to recover from errorsthrough reconstruction i.e. RAID)!

SNAPSHOTS AND ELASTIC IPSTaking a snapshot (backup)Store data on EBS via snapshots, acopy of your EC2 instance at currenttime.Re-associate IPs from the faultedinstance to a new working instance.This can be done via the API orthrough the Management ConsoleRe-associating IP

HOUSE OF CARDS2% of the entire Netflix population, 670,000 usersBINGE watched all 13 episodes of House of Cardson it’s first weekend.This is a 400% increase in traffic at the time over it’sfirst seasonPaving the way for Internet TVUtilizing Elastic Load Balancing, Netflix mitigates thisextreme traffic.

FAILURES CAN BE USEFUL

YOUR APPLICATION AVAILABILITY INTERRUPTEDSophisticated software systems aredependent on a number of componentsthat are out of its controle.g. operating system, firmware, andhardware

MOST SOFTWARE DEGRADES1) Leak memory and/or resourcese.g. application frameworks, operating systems & device drivers2) File systems fragment over time3) Hardware physically degrades over time particularly storage

REGULARLY MAINTAINED AND SERVICEDTraditional IT environment hardwaremaintenance and servicing has practicaland financial limits.These limits can constrain how efficientand effective servicing Traditional ITenvironments.

FAILURE FORCING RESOURCE REFRESHAWS platform can be refreshed periodically with new server instances whichreduces potential system degradation.AWS server instances themselves become immaterial and even disposable.Set expiry dates to refresh instances regularly to ensurethat any leaks or degradation will not impact theapplication.

AWS AUTO SCALINGRules can be defined for scaling EC2 capacityfor launching or terminating server instances1)When the number of functioning serverinstances is above or below certainnumber2)Using Amazon CloudWatch for monitoringcertain threshold of server instance fleetresource utilizationAdd more instances in response to an increasingload; automatically terminate extra instanceswhen no longer needed.

AMAZON CLOUDWATCHi.e. Create an CloudWatch alarm tooccur whenever the CPUUtilization metricfor the EC2 instances exceeds 90%.When the alarm occurs, Auto Scalinglaunches and configures another instanceto join the application tier.The instance takes a couple of minutes tolaunch. During that time, the CloudWatchalarm could continue to fire, resulting inAuto Scaling launch another instanceeach time the alarm goes DeveloperGuide/Cooldown.html

AUTO SCALING COOLDOWNWith a cooldown period in place, AutoScaling launches an instance and thensuspends any scaling activities until a specificamount of time elapses.The newly-launched instance has time to starthandling application traffic.After the cooldown period expires, scalingactions resume for the Auto Scaling t/DeveloperGuide/Cooldown.html

ELASTIC LOAD BALANCINGSends any request to a DNS host name thendistributes incoming traffic for the applicationacross a pool of several EC2 instancesDetects unhealthy instances within the pool ofEC2 instances and automatically reroutes trafficto healthy instances

ELASTIC LOAD BALANCING

COMBINING AUTO SCALING &ELASTIC LOAD BALANCINGElastic Load Balancing uses a single DNS name foraddressingAuto Scaling ensures there is always an right number ofhealthy EC2 instance to accept requests in conjunctionwith Elastic Load Balancing, each instance will handle afraction of the incoming traffic.Redundancy Pattern N 1 where N resources aresufficient to anticipated load.

REGIONS AND AVAILABILITY ZONESSimultaneously running anapplication distributedgeographically at distant AmazonWeb Services datacenters achievesgreater fault toleranceIf a single datacenter fails theapplication is protected bygeographically distant A6j8yJ4ZzI/AAAAAAAAAhA/QLGoIwJFr4/s1600/AWS global infrastructure.jpg

AWS GEOGRAPHIC REGIONS5 regions:1)US East (Northern Virginia)2)US West (Northern California)3)EU (Ireland)4)Asia Pacific (Singapore)5)Asia Pacific (Japan)Amazon S3 (Simple Storage Service) US Standard encompasses datacentersthroughout the United States.

AVAILABILITY ZONES (AZS)Within each Region are Availability Zonesthat are engineered to be insulated fromfailures in other AZs and provideinexpensive, low latency networkconnectivity to other AZs in the same region.The Amazon EC2 service level agreementcommitment is 99.95% availability for eachAmazon EC2 region.

BUILDING MULTI-AZ ARCHITECTURESAchieve High Availability by deploying yourapplication and independent copy of eachapplication stack that spans across multipleAvailability Zones creating a multi-sitesolution.Elastic Load Balancing will detect healthy &unhealthy EC2 instances in the same AZ or inmultiple AZs then no longer route traffic tounhealthy EC2 instances and route traffic toremaining healthy EC2 instances.

FAULT TOLERANT BUILDING BLOCKSAmazon Simple Queue Service (SQS)Amazon Simple Storage Service (S3)Amazon SimpleDBAmazon Relational Database Service (RDS)

AMAZON SIMPLE QUEUE SERVICE (SQS)Distributed message queuing systemEC2 Instance BasedURL-based message queues Any server process that understands HTTP can access thequeue ACL-based security systemFour-day message retentionAuto Scaling of EC2/SQSVisibility Timeout after message pull

AMAZON SIMPLE STORAGE SERVICE (S3)Permanent storage serviceHighly durable and fault tolerantStores objects redundantly on multipledevices across multiple facilities in aregionURL-based access to storage for webservice (similar to SQS)Provides versioning

AMAZON SIMPLEDBAttribute decorated data storageObjects retrieved via attributes (metadata)assigned when object was createdCan be used with or in place of MySQL or MSSQLRedundant storageAccessible via URL’s with various APIs(JavaScript, etc.)Download Amazon’s ScratchPad tobegin learning how to use it

AMAZON RELATIONAL DATABASE SERVICEProvides relational database services, suchas MySQL, PostgreSQL, MS-SQL, Oracle,etc.Snapshots similar to EDB instancesMulti-AZ ready instances-synchronousreplica of DB instance is maintained in adifferent AWS zone

CONCLUSION Redundancy with load balancingand routing is key to fault tolerance Automatic and manualscaling/management via a myriadof methods is available todevelopers Amazon provides all levels of cloudcomputing (IaaS, PaaS, SaaS) Geography matters!

ANY QUESTIONS?SlideShow ReferencesBarr, J., Narin, A., & Varia, J. (2011). Building Fault-Tolerant Applications on AWS.Amazon Web Services.Christensen, B. (2012). Fault Tolerance in a High Volume, Distributed System. TheNetflix Tech Blog.Mitovich, M. W. (2014, February 21). Report: 2% of Netflix Users Binged all ofHouse of Cards Season 2 ASAP -- Are you Done Yet? Retrieved from of-cards-season-2-binge-watching/GitHub Repo

ELASTIC BLOCK STORES (AMAZON EBS) EBS Offline storage; backups! Perfect for recovering from a fault Because the new instance is a copy of the original you would lose essentially no data and no functionality! Annual fail rate for EBS is .1% and .5% rather than 4% in a standard data scenario!