Transcription
Survive in CloudThe Zen of High Availability at Massive Scale in Cloudchao.cai@mobvista.com
MobvistaNo.1 C h i n a200 C o u n t r i e s / R e g i o n sTOP 10worldwide950M D M P ’s D A U320M Mintegral SDK DAU60BDaily Ads request
All in CloudPublisherAdvertiserBig Data & MLTracking ServiceVolume Processing ServiceSDKEMRS3S3KinesisSpot FleetElastiCacheAutoScalingKinesisMetrics & AlarmCloudWatchLambdafunctionOffer managementRDSinstancesSQSinstancesOnline DMPManualESDynamoDBAutoScalingAPIRedshift*Spot FleetRTB
Cloud ComputingCloud CharacteristicsService GoalsOn-DemandQuick ScalingRapid elasticityPay per useUncertaindowntimeLow CostHigh ReliableHigh Available
Fault Oriented
Once you accept that failures will happen, you have the ability todesign your system’s reaction to specific failures.
Extension PointExtension lated DesignExtension PointMicro KernelExtension PointExtension PointExtension Pointplug-inplug-inplug-inplug-inplug-inplug-in
Isolated DeploymentOrderingServiceCart vice
Reused vs. IsolatedCritical Data CollectorLog Data CollectorCritical Data CollectorLog Data CollectorData Transform ServiceData Transform ServiceData Transform ServiceReused logic structure vs. Isolated physical structure
Redundancy
RedundancyLoad BalancerOnline ServiceLoad BalancerStandby ServiceOnline Redundancy
Common Failure Modes
Propagated FailureQPS 1500Load BalancerMax QPS 1000
Rate Limit
Cascading eBServiceE
Circuit Breaker
Circuit eDServiceBServiceE
Slow ResponseA quick rejection is better than a slow response.Pooled resources are exhausted!
No Unlimited WaitingAny blocking operation needs a time limit!
Recovery Oriented
“A priori prediction of all failure modes is not possible.”
Health Check Zombie Process Pooled resources exhausted Dead Lock
Recoverable Say “NO” to Monolithic system Stateless Survive when the dependentservices crashing Quick restart
Let it Crash!try{ }catch (Throwable t){}
Negotiate With ClientServer: “I am busy, please, slow down”Client: “Get back to me, after one minute.”
Chaos Engineering
“If something hurts, do it more often!”
Chaos EngineeringChaos under control You learn how to fix the thingsthat often break.Terminate host You don’t learn how to fix thethings that rarely break.Inject latencyInject failure
Chaos EngineeringSSet expected SLAInject FailuresMeasure servicesImprove systemmeet SLA?E
Chaos Engineering Principles Build a Hypothesis around Steady State Behavior Vary Real-world Events Run Experiments in Production Automate Experiments to Run Continuously Minimize Blast Radiushttp://principlesofchaos.org
Higher Resilience, Lower Cost
CostScale
Spot Instance
Spot InstanceFault and Recovery Oriented ArchitecturemicroserviceAuto ScalingSpot Fleetstatelessquick restartfault toleranceReserved Instancechaos engineering
Multi-Clouds Ecosystem
Multi-Clouds FoundationMobvista AI PlatformBig Data PlatformMachine Learning PlatformMobvista Cloud SolutionHigh ReliabilityDevOpsCost OptimizationMobvista Cloud PlatformCI/CD PipelineSpot Instance MgrAuto ScalingSmart Load BalanceLoggingMonitoringAlarmCloud ConnectionAWS APIAWS CLIAli APIAli CLIPublic Cloud PlatformAWS CloudAli Cloud
Service Decoratorhttps://github.com/easierway/service decorators/blob/master/README.md
The Zen of High Availability at Massive Scale in Cloud chao.cai@mobvista.com. Mobvista No.1 950M 320M 200 TOP 10 Mintegral SDK DAU China Countries/Regions world-wide . Load Balancer Load Balancer Online Redundancy. Common Failure Modes. Propagated Failure Load Balancer QPS 1500 Max QPS