Survive In Cloud

Transcription

Survive in CloudThe Zen of High Availability at Massive Scale in Cloudchao.cai@mobvista.com

MobvistaNo.1 C h i n a200 C o u n t r i e s / R e g i o n sTOP 10worldwide950M D M P ’s D A U320M Mintegral SDK DAU60BDaily Ads request

All in CloudPublisherAdvertiserBig Data & MLTracking ServiceVolume Processing ServiceSDKEMRS3S3KinesisSpot FleetElastiCacheAutoScalingKinesisMetrics & AlarmCloudWatchLambdafunctionOffer managementRDSinstancesSQSinstancesOnline DMPManualESDynamoDBAutoScalingAPIRedshift*Spot FleetRTB

Cloud ComputingCloud CharacteristicsService GoalsOn-DemandQuick ScalingRapid elasticityPay per useUncertaindowntimeLow CostHigh ReliableHigh Available

Fault Oriented

Once you accept that failures will happen, you have the ability todesign your system’s reaction to specific failures.

Extension PointExtension lated DesignExtension PointMicro KernelExtension PointExtension PointExtension Pointplug-inplug-inplug-inplug-inplug-inplug-in

Isolated DeploymentOrderingServiceCart vice

Reused vs. IsolatedCritical Data CollectorLog Data CollectorCritical Data CollectorLog Data CollectorData Transform ServiceData Transform ServiceData Transform ServiceReused logic structure vs. Isolated physical structure

Redundancy

RedundancyLoad BalancerOnline ServiceLoad BalancerStandby ServiceOnline Redundancy

Common Failure Modes

Propagated FailureQPS 1500Load BalancerMax QPS 1000

Rate Limit

Cascading eBServiceE

Circuit Breaker

Circuit eDServiceBServiceE

Slow ResponseA quick rejection is better than a slow response.Pooled resources are exhausted!

No Unlimited WaitingAny blocking operation needs a time limit!

Recovery Oriented

“A priori prediction of all failure modes is not possible.”

Health Check Zombie Process Pooled resources exhausted Dead Lock

Recoverable Say “NO” to Monolithic system Stateless Survive when the dependentservices crashing Quick restart

Let it Crash!try{ }catch (Throwable t){}

Negotiate With ClientServer: “I am busy, please, slow down”Client: “Get back to me, after one minute.”

Chaos Engineering

“If something hurts, do it more often!”

Chaos EngineeringChaos under control You learn how to fix the thingsthat often break.Terminate host You don’t learn how to fix thethings that rarely break.Inject latencyInject failure

Chaos EngineeringSSet expected SLAInject FailuresMeasure servicesImprove systemmeet SLA?E

Chaos Engineering Principles Build a Hypothesis around Steady State Behavior Vary Real-world Events Run Experiments in Production Automate Experiments to Run Continuously Minimize Blast Radiushttp://principlesofchaos.org

Higher Resilience, Lower Cost

CostScale

Spot Instance

Spot InstanceFault and Recovery Oriented ArchitecturemicroserviceAuto ScalingSpot Fleetstatelessquick restartfault toleranceReserved Instancechaos engineering

Multi-Clouds Ecosystem

Multi-Clouds FoundationMobvista AI PlatformBig Data PlatformMachine Learning PlatformMobvista Cloud SolutionHigh ReliabilityDevOpsCost OptimizationMobvista Cloud PlatformCI/CD PipelineSpot Instance MgrAuto ScalingSmart Load BalanceLoggingMonitoringAlarmCloud ConnectionAWS APIAWS CLIAli APIAli CLIPublic Cloud PlatformAWS CloudAli Cloud

Service Decoratorhttps://github.com/easierway/service decorators/blob/master/README.md

The Zen of High Availability at Massive Scale in Cloud chao.cai@mobvista.com. Mobvista No.1 950M 320M 200 TOP 10 Mintegral SDK DAU China Countries/Regions world-wide . Load Balancer Load Balancer Online Redundancy. Common Failure Modes. Propagated Failure Load Balancer QPS 1500 Max QPS