Next Generation Of DevOps AIOps In Practice @Baidu - USENIX

Transcription

Next Generation of DevOpsAIOps in Practice @BaiduXianping Qu & Jingjing Ha

About Baidu

Agenda History of Baidu SRE teamNext generation of DevOps – AIOpsBest practice based on AIOpsFuture

Tools (2007-2009) Consulting Problems– Human laborDevQAOP

Systems (2009-2012) Building operation systems– Service management systemDevQAOP– Monitoring system– Traffic scheduling system– Naming service– Problems– Human labor(GUI, configure)Operation System– Deployment system

Platforms (2012-2014) Building operation platforms Problems– Reusability– ScalabilityDevQAOPOperation PlatformAPIConfigurableExecutable Operation Service––––

Standardization (2014-) Building operation standards––––Unified languageUnified methodUnified solution Problems– Need a brainServices, IDCsClusters, Servers

AIOps (2014-) Intelligent Operation Platforms– Development framework– Big data– Algorithm Data mining, machine learning

Development frameworkUser Code 1User Code User Code 2User Code ManagementOperation Abstraction LayerDevelop KitInterfaceDevelop ldTestCloud/PaasSchedulerDebugProf

Operation Knowledge DatabaseMetric dataMeta serviceIDCinstancecpunetworkbandwidthEvent datalatencyerroranomalychangeiodiskroot causeremediationmemrtt.API & ViewmappingDataproductionprocessraw datacleaningcalculatingcontrollingminingservice management modelDatasourcestructuraldataManagement platformsmetaDB,TSDB,eventDBMonitoring platformsresultingdatafeedbackauthority, quotaOperation platforms

SolutionSolution DevelopmentAnomaly detectionTraffic schedulingRoot cause analysisTrend forecastingOther data mining &machine learningalgorithmsDeploymentManagementIncident ManagementOps Algorithm Development Ops Platform DevelopmentOperation Development Framework ( SDK, RE )Operation Knowledge DatabaseGeneral Components and Tools(transmission、storage、scheduling )

Best practice based on AIOps Incident Management– Single cluster stop-loss by traffic shifting Deploy Management– Unattended deployment with automated checker Consulting– ChatBots do Consultation

When will a failure occur? Infrastructure issue Program defects Change exception Dependent service unavailable

How to stop loss? Limited Failure in one cluster Capacity redundancy– Deployment isolation– Availability and cost trade-off– Dependency decoupling– N M redundancy– Reduce global risk– Service degradationWhen single cluster fails do perform traffic shifting

Two layer traffic shifting hift traffic between userset and the edge nodeBGWL4 LBBGWL4 LBBGWL4 LBBFE ClusterL7 LBBFE ClusterL7 LBBFE ClusterL7 LBWeb ServiceclusterWeb ServiceclusterWeb ServiceclusterDependentService clusterDependentService clusterDependentService clustershift traffic between frontend and back endBGW: Baidu Gate Way, layer-4 load balancerData Center1Data Center2Data Center3BFE: Baidu Front End, layer-7 load balancer

Two layer traffic shifting @Baidu shift traffic between user set and edge node– 10 minute to shift 80% traffic to the healthy edge node becauseof DNS caching in the client side and ISP side shift traffic between front end and back end– 10 second to shift 100% traffic to the healthy backend bychanging BFE’s routing configuration

Shift traffic between front end and back endInternetConcerns: Service Capacity Intranet bandwidthScenarios: Web service cluster Dependent service cluster Internal network switchUsersetUsersetUsersetUsersetUsersetBGWL4 LBBGWL4 LBBGWL4 LBBFE ClusterL7 LBBFE ClusterL7 LBBFE ClusterL7 LBWeb ServiceclusterWeb ServiceclusterWeb ServiceclusterDependentService clusterDependentService clusterDependentService clusterData Center1Data Center2Data Center3

Shift traffic between user set and the edge nodeInternetConcerns: Bandwidth BGW/BFE Capacity DelayScenarios : BGW & BFE External network switchUsersetUsersetUsersetUsersetUsersetBGWL4 LBBGWL4 LBBGWL4 LBBFE ClusterL7 LBBFE ClusterL7 LBBFE ClusterL7 LBWeb ServiceclusterWeb ServiceclusterWeb ServiceclusterDependentService clusterDependentService clusterDependentService clusterData Center1Data Center2Data Center3

Shift traffic between user set and the back endInternetConcerns: Bandwidth BGW/BFE Capacity Delay Service Capacity Intranet bandwidthScenarios:: Entire data center failureUsersetUsersetUsersetUsersetUsersetBGWL4 LBBGWL4 LBBGWL4 LBBFE ClusterL7 LBBFE ClusterL7 LBBFE ClusterL7 LBWeb ServiceclusterWeb ServiceclusterWeb ServiceclusterDependentService clusterDependentService clusterDependentService clusterData Center1Data Center2Data Center3

Single cluster stop-loss before AIOPsPerceptionJudgment Lots of False Negativesand False PositivesScattered ScriptsManual decision makingTraditional monitoring Handcrafted anomalydetectionExecution Depends on personalexperiencePartial information Wrong or slow decision making No real-time feedback leadsto more serious failures Poor code qualityLow availability Critical moment is unreliable

Single cluster stop-loss after AIOPsPerceptionIntelligent monitoringJudgmentAutomated decision makingExecutionStandard framework Intelligent anomaly detection Abnormal event stored inOperation KnowledgeDatabase Rely on Algorithm Platform Global information Development framework Deployment framework High precision and recall Accurate and quick decisionmaking Feedback control High development efficiency High availability

The architecture of single cluster etectionAlgorithmVIP Health ControlService Health ControlOperationKnowledgeDatabaseAbnormalevent DBJudgmentMetric DBTrafficshiftingAlgorithmInternal TrafficLoad balancerDNS ConfigurationManagementExecutionBFE B

Unattended deployment with automated checkerPersonRobotbeginpre-checkconfirmManual multiple metric dashboards checkpipe linedeployconfirmCheck process optimizationIntelligent Anomaly detectioncheck Manual single anomaly event dashboard on Knowledge DatabaseAutomated API check and notify

ChatBots do consultation Change consultation scenario Key points of building ChatBots1. Accumulate manually labeled query2. Train an NLP model to understand thequestions offline3. Translate natural language questions intostructured questions4. Query operation knowledge database5. Display results on SRE Service deskquery 上线么?Change queryModule : 流量上线么?Change queryModule : xx; Time : 今天 ;Stage:全流量

Future Dynamic resource allocationCapacity managementIdentification of performance problems

quxianping@baidu.comhajingjing@baidu.com

AIOps in Practice @Baidu Xianping Qu & Jingjing Ha. About Baidu. Agenda History of Baidu SRE team Next generation of DevOps -AIOps Best practice based on AIOps Future. Tools (2007-2009) Dev QA OP DevOps - Deployment . DNS Configuration Management Internal Traffic Load balancer Capacity DB Metric DB Operation Knowledge .