A Proactive Cloud Management Architecture For Private Clouds

Transcription

2013 IEEE Sixth International Conference on Cloud ComputingA Proactive Cloud Management Architecture for Private CloudsDapeng DongDepartment of Computer ScienceUniversity College CorkCork, Irelandd.dong@cs.ucc.ieJohn HerbertDepartment of Computer ScienceUniversity College CorkCork, Irelandj.herbert@cs.ucc.iewith the problem of optimal placement of several hundredsof Virtual Machines (VMs) and the need to respond tothousands of randomly occurring system events, it is easy toconclude that the reactive management approach is no longersuitable for cloud management. As well as managementcomplexity, tools for managing cloud infrastructure are oftenavailable to those with large budges in IT. A distributed,multi-tenant operation management service can lower suchcost as well as on operations and facility.To better respond to business demands on IT resources,the term Proactive Management has been stressed by manyindustrial cloud management solution pioneers [5] [6] [7].Proactive management, in essence, deals with the management life-cycle of information collection, event detection/analysis, and response. Consideration must also begiven to aspects such as transmission of metric data tothe management service components, metric data storage,anomaly detection and resource management, and appropriate timely event response. These challenges and their solution characterize the proposed architecture and differentiatethis work from others.A prototype implementation of the proactive operationmanagement was deployed on the IBM SmartCloud, and asimulated private cloud client was connected to this service.A set of real-world workloads were given to each simulatedentity of the simulated private cloud. The important aspectsof the architecture were evaluated in terms of communication cost, Cloud Snapshot transmission cost, and theeffectiveness of the Calendar-based Data Storage Model. Theevaluation demonstrated the effectiveness and usability ofthe proposed architecture.The remainder of the paper is organized as follows.Section II presents and discusses the proposed architecture. Section III evaluate the prototype implementation ofthe architecture. A discussion of related work follows insection IV, and the final section presents conclusion anddirections of further research.Abstract—Operation management for a private cloud infrastructure faces many challenges including efficient resource allocation, load-balancing, and quick response to real-time workload changes. Traditional manual IT operation management isinadequate for this highly dynamic and complex environment.This work presents a distributed service architecture whichis designed to provide an automated, shared, off-site operationmanagement service for private clouds. The service architectureincorporates important concepts such as: Metric Templates forminimising the network overhead for transmission of cloudmetrics; a Cloud Snapshot that provides a global view ofthe current status of the cloud, supporting optimal decisionmaking; and a Calendar-based Data Storage Model to reducethe storage required for cloud metric data and increase analysisperformance. A proactive response to cloud events is generatedbased on statistical analysis of historical metrics and predictedusage. The architecture, functional components and operationmanagement strategies are described. A prototype implementation of the proposed architecture was deployed as a serviceon the IBM SmartCloud. The effectiveness and usability of theproposed proactive operation management solution has beencomprehensively evaluated using a simulated private cloud withdynamic workloads.Keywords-Architecture, Cloud, Operation ManagementI. I NTRODUCTIONCloud computing introduces a new computing paradigmto IT organizations. The cloud deployment of services ismaturing at pace. It seems that market momentum makesthe widespread adoption of cloud computing inevitable. Atthe same time, the use of a public cloud poses concerns,such as security, privacy, data confidentiality, infrastructurecontrol, and vendor lock-in (as discussed, for example, [1][2] [3]). In this context, the use of private and hybrid cloudsbecome important alternatives for many organisations.Acquiring a private/hybrid cloud brings IT managementresponsibilities back to the IT organizations. In particular,cloud operation management is different from traditionalIT operation management. The new cloud concepts, suchas: asynchronous architecture; virtualization; resource fabric,require IT personnel to gain new knowledge and skills inorder to efficiently manage the cloud infrastructure. Mostcloud vendors provide private cloud operation managementsuites [2] [4]; these are essentially a set of tools given to ITpersonnel to ease operation management processes. Faced978-0-7695-5028-2/13 26.00 2013 IEEEDOI 10.1109/CLOUD.2013.19II. A RCHITECTURE OVERVIEWThe proposed architecture (Figure 1) has two high-levelcomponents, a Service Delegator and an Operation Management Service.701

Private atorFigure 1.Service yserActionManagerServiceDelegatorEventMonitorMath EngineObject DBR FrameworkInstance 1R FrameworkInstance nOperation Management bscriberThe Proactive Operation Management Architecture component diagramA. The Service Delegatorerate Metric Template meta-data, and implement it throughinteraction with the Service Delegator. The Metric Templatemeta-data is also used to control the subscription servicelevel (such as: bronze, silver, and gold) by manipulating thenumber of metrics of interest and MTPI, etc.The Service Delegator acts as a middleware between aprivate cloud and the cloud Operation Management Service (OMS). A key design consideration for the ServiceDelegator is that the Service Delegator must not provideany publicly accessible point. Network traffic between theService Delegator and the OMS can be bidirectional, butthe communication session can only be initiated from theService Delegator to ensure security. To satisfy this designgoal, the components of the Service Delegator need to actively and periodically check with the management services.Each component of the Service Delegator is a self-containedprogram; they can also be gathered together and providedas a VM image.Essential for the operation of this architecture are thepre-deployed metric monitors on each VM and hyper-visor.Each monitor periodically emits pre-defined metrics to acentral point – the Metric Publisher. The metrics sent frommetric monitors are often raw data, and usually containlarge amounts of redundant and useless information. Inorder to minimize the impact of sending metrics to themanagement service on the local (private cloud) network, theMetric Publisher uses the collected metrics to fill up MetricTemplates. A Metric Template is essentially a compact datastructure which contains a set of ID tags of cloud entities(Servers and VMs), each ID tag is associated with a seriesof floating point numbers (metrics) and the order of metricsare known to the both Service Delegator and OMS. Thenumber of metrics of interest and the order of the metrics aredefined by Metric Template meta-data. The Metric Templatemeta-data also contains other auxiliary information includingcompression scheme, Metric Template Publishing Interval(MTPI), etc., that keep the Service Delegator and OMSsynchronized. It is the responsibility of the OMS to gen-There are four types of Metric Template defined in theprototype implementation: 1) Metric Template for ServerConfiguration (MTSC) which contains a list of physical servers with configuration information, current statusand server ID; 2) Metric Template for VM Configuration(MTVC) which contains a list of VMs with configurationinformation, current status, VM/server ID, and service ID;3) Metric Template for VM Utilization (MTVU) whichcontains a list of VMs with current utilization status of eachVM component, current status, and VM/server ID; 4) MetricTemplate for Server Utilization (MTSU) which contains alist of servers with I/O related information, such as memoryread/write throughput, storage read/write throughput, andserver ID. Within each Metric Template, entities/metrics areseparated by selected delimiters accordingly. After filling upa Metric Template, the Metric Publisher compresses it; prefixes a message-type tag, a time stamp, and a subscriber IDto the compressed Metric Template; then encapsulates everything into a message using Base64 encode and sends itto the management service Exchange.The Metric Publisher publishes Metric Templates at aregular time interval - the Metric Template Publishing Interval (MTPI). For the purpose of bandwidth conservationand due to the fact that configuration information rarelychanges, the MTPI for MTSC and MTVC templates are setto be longer than the one for MTVU and MTSU templates.Notice that metric monitors may emit their measurements atdifferent point of time. Therefore, within a MTPI, a MetricTemplate can be in an uncompleted form. For instance,702

erOther nager. CMcribeCloud torSubscriber.1Subscriber.xTo other subscribersSubscribData ultsExchangeAnalyser(Output)SimulatorFigure 2.The Proactive Operation Management Architecture communication diagrama MTVU may not contain all active VMs in the cloud;and any missing data (e.g., storage utilization data) of aVM listed in the Metric Template is indicated by a specialcharacter in the Metric Template. The Request Publisherdoes a similar function but deals with customized requests,such as requests for suggestions for a new VM placement,and these customized requests will be sent immediately. Inthis work, the MTPI is a fixed time interval. Ideally it wouldbe dynamically adjusted by the activity level of the privatecloud, but this is primarily limited by the statistical analysisbased optimisation engine, and it will be investigated furtherin future work.The Suggestion Subscriber component actively and periodically checks with the management service providerwhether there is any information available. The frequencyof receiving Suggestions shall be much higher than MTPI toavoid missing and/or disordered Suggestions. It only receivesSuggestions. Suggestions are encapsulated in the payload ofthe subscribed messages in XML (eXtensible Markup Language) format. Code list 1 shows a fragment of a Suggestionfor migrating a VM from hostA to hostB (Different actionsare associated with different sets of pre-defined attributes.Furthermore, each action is also associated with a list ofreasons which identify the causes of such an action). Inorder to achieve automation in the operation managementlife-cycle, an Action Manager component is provided. Itcontains a set of Action Templates which are written inRESTful (Representational State Transfer) APIs. Upon receiving a Suggestion, the Action Manager will firstly checkthe validity of the Suggestion (Using the ” reason ”field). If this Suggestion is still valid, the Action Managerwill use the information from the Suggestion to fill up acorresponding Action Template and carry out the action inthe private cloud. Otherwise, the Suggestion will be ignored. Suggestion action migrate entity-id vm id source host A destination host B reason src over util /Suggestion .Listing 1. /action /entity-id /source /destination /reason VM migration SuggestionB. The Operation Management ServiceThe Operation Management Service (OMS) is provided asa multi-tenant service. The Service Engine is the core of theOMS, and it is supported by a sophisticated MathematicalAnalysis Engine.703

to be connected to the OMS for a certain period of timeto ensure the cloud model is relatively consistent with theactual private cloud. The consistency level is measured bythe number of occurrence of server creation processes in thecloud model.The Cloud object is created at the service registrationphase. The Server, VM, and Component objects are created upon receiving MTSC and MTVC respectively. If aServer/VM has already been created in the cloud model,the received data is then used for update purposes. Uponreceiving MTSU/MTVU, utilization data of servers/VMswill be logged into Utilization Cache (Figure 3). The Utilization Cache is a FIFO (First In First Out) queue. It is usedto cache a certain length (a day) of utilization histories whichwill be used by the Optimiser. If a server/VM listed in theMTSU/MTVU doesn’t exist in the current cloud model, thenit will be ignored. Because MTSU/MTVU doesn’t containserver/VM configuration information, creating server/VMobjects without configuration information is meaninglessin the cloud model. This can be remedied by receivingsubsequent MTSC/MTVC. If both MTVC and MTVU for aVM have not been received for a certain length of time, itwill be considered to be in sleep mode, and eventually beremoved. The Cloud model is used directly by the Optimiser.Optimiser. The Optimiser is event driven. It is triggeredupon receiving requests, MTSUs, or MTVUs. The Optimiser is tightly coupled with the Cloud Modeller. At thebeginning of the service subscription, a dedicated Optimiserwill be assigned to a subscriber (in fact, it is assigned toa cloud model which is specially built for the subscriber).The Optimiser and the Cloud Modeller run in the sameprogram process but in separate threads, and listening ontheir own topic exclusive queues. A proactive response tocloud events is generated based on statistical analysis ofhistorical metrics and guided by policies. The historicalmetrics are the data cached in Utilization Cache (Figure3). Various restrictions are defined in the Policy includingVM affinity, thresholds for triggering load balancing events.The generated responses are called Suggestions. Suggestionsare formatted in an XML file, Base64 encoded, and sent tothe Exchange (Output). The Exchange (Output) routes theSuggestions based on the Subscriber ID (each subscriberhas dedicated Suggestion queues).Data Modeller. The Data Modeller builds resource usagemodels for services. Data models are stored and organised ina Calendar-based Storage Model (CBSM). In simple terms,the CBSM just provides object storage. Objects (data models) stored in the CBSM are indexed by calendar date so thatdata models can be associated with calendar events (such asweekends, public holidays). There are mainly two reasonsfor storing resource usage data models rather than the original data. The first reason is to reduce the storage requiredfor cloud metric data. The OMS continuously receives cloudmetrics from subscribers, storing this accumulated data r ID Service IDCPU ListMemoryvCPUSpeedOtherthh AttributesI/OComponentStorage ListvCPUFigure 3.AttributeNetwork ListOtherhe ObjectsUtilization CacheCloud model hierarchy1) The Service Engine: The Service Engine receivesrequests and metrics from subscribers through Exchange (Input). The Exchange (Input) module acts as a common communication interface among subscribers. It essentially is aqueuing system which buffers incoming messages. Messagesare directly consumed by the Event Monitor. The EventMonitor decodes messages, checks expired and miss-orderedmessages based on the time stamp and subscriber ID, thendispatches decoded messages to the designated event-groupqueue according to the message type. The Event Monitordefines three groups of events (CM CloudM odelling , DM DataM odelling , and Req Requests )(Figure 2) by default. Each group is called a Topic, andTopics are sent to topic exclusive queues accordingly. Behind each topic queue, there are three compulsory modules(Cloud Modelling, Optimiser, and Data Modelling) built intothe architecture. They are functionally independent.Cloud Modeller. The Cloud Modeller builds a cloudmodel for each subscribed private cloud. In order to makecorrect decisions on cloud operations, such as consolidationof VMs and resource provisioning, a global view of asubscriber (private cloud) is absolutely necessary. The CloudModeller organises cloud objects in a hierarchy. There arefour levels (Cloud, Server, VM, and Component) in thehierarchy illustrated in Figure 3. Ideally, a full cloud modelis built at the beginning of a service subscription. In areal industrial deployment, private clouds may already uprunning, and it can be hard to get all information about aprivate cloud at once. For these reasons, a cloud model canbe built gradually. In another words, a private cloud needs704

serious cost implications. The Data Modeller builds resourceusage models for services on a daily, weekly, monthly, yearlybasis. Data models are in fact program objects (generic Javaobjects, because there are many choices for modelling data,data models are cast to generic objects and tagged, thenstored in CBSM). The compressed data model objects aremuch smaller then the compressed original data (discussedin section III-C). The second reason is to improve theperformance of analysis through model reuse. ModellingData is often a CPU intensive and time consuming process.Using pre-built data models can significantly improve theperformance of the Analyser.Two points should be noted. 1). A service is identified bythe service ID (Figure 3). The service ID only exists in thecloud model. It is assigned to be the same as VM ID. If aVM is load balanced, the same service ID will be sharedamong them. On the other hand, a service ID is used todetermine whether a VM is load balanced. If a service is loadbalanced, the resource usage for the service will be the sumof the resource usages of the same kind. 2). The source ofthe original data is the cloud model. The cloud model cachesresource utilization data for a day in the Utilization Cache,and when the Utilization Cache is full, it is sent to the DataModeller to build daily data model. Rather than sendingthousands of Utilization Cache data individually, the OMSsends the most recent Cloud Snapshot to the Data Modeller.A Cloud Snapshot is simply a serialized cloud model objectwhich contains a snapshot of the current cloud includingany cached data. The Utilization Cache data will also betemporarily stored for a longer period (a month). Afterbuilding a monthly data model, the raw data will be removedpermanently (a yearly model can also be built based on thedaily model).There are no resource usage models built for physicalservers. The cloud environment is highly dynamic. Eventsof VM creation, deletion, migration, load-balancing and resizing occur frequently and randomly on physical servers. Ina such dynamic environment, long term utilization patternsand trends of physical servers contribute no explicit insightfor improvement of QoS (Quality of Service).The data modelling process is triggered by the Scheduleras well as the Analysing process.Analyser. The Analyser has two built-in functions: consolidation of VMs and resource provisioning. It is a consumerfor both the Cloud Modeller and Data Modeller. In general,the Analyser analyses global status of the cloud using themost recent Cloud Snapshot to determine whether VMsare distributed sparsely in the cloud; and calculates optimalsolutions for consolidation of VMs using data models whichhave been built by the Data Modeller. It also uses datamodels to do resource provisioning.2) The Mathematical Analysis Engine: The MathematicalAnalysis Engine supplies a set of sophisticated statistical andmathematical functions to the Analyser, Optimiser, and DataPrivate CloudSimulatorEventMonitorCollegeIntranetFigure 4.BExchangeACloud ModellerOptimizer21ServiceDelegator34AnalyserData Modeller5MathEngineIBMSmartCloud (IaaS)OMS experiment deploymentModelling components. Because the consumer componentsrequire a wide range of functions across branches of mathematics (such as Structured Time Series forecast techniqueused by Analyser; sorting algorithms used by Optimiser;and Auto Regressive Integrated Moving Average data modelling technique used by data modeller), an extensible andcomprehensive mathematical analysis system is needed. TheR framework [8] was employed at the heart of the Mathematical Analysis Engine. R is an open source, statisticalframework intensively used in the field of data analytics. Itsflexible and extensible architecture allows packages (varioustypes of functions) to be installed in a plug-and-play stylewhich best meets our design requirements.If the OMS service is deployed on a private cloud, Thesystem itself is also a subscriber of its own services. Figure2 illustrates the proposed architecture and it is: 1) scalable– each topic subscriber (a functional module) can have multiple instances listening to the same topic queue, and taskscan then be distributed on multiple topic subscriber instanceswhich perform the same functions; 2) extensible – as longas new topic definitions are configured at Event Monitor andtopic exclusive queues are in place, new functional modulescan be added in at any time, without interfering with othermodules; and 3) flexible – introducing and removing anyfunctional modules has no effect on the operation of othermodules.III. E VALUATIONA prototype implementation has deployed on the IBMSmartCloud platform (Infrastructure as a Service - IaaS)(Figure 4). It is a full implementation of the architecture withessential core functionalities for it to work. There were fiveVM instances employed for the OMS deployment. All VMinstances were configured with two virtual CPUs (2.4GHz),4GB memory, and 60GB local storage. Redhat 6.3 Enterprise (64-bit) Linux operating system and JRE (JavaRuntime Environment) version 1.6.0 39 were installed onall VM instances. They were located in the IBM Data Centre, Ehningen, Germany. VMware RabbitMQ 3.0.2 queuing705

Metric Template Size (KB)Figure 5.2.01.6Network bandwidth consumption for Service Delegator and OMS communication over 30 minutesEach Metric Template was compressed using ZIP streamalgorithm provided by the standard Java package beforesending to the OMS. The dashed yellow-line indicates thebandwidth consumed by sending Metric Templates to theOMS and the cost for transmission of Metric Templates arefound to be relatively small. It increases linearly with thenumber of VMs (Figure 6). The red solid-line indicates thebandwidth consumed by receiving Suggestions. The receiveddata is much larger than the sent data. This is mainly drivenby the number of Suggestions received, and Suggestions arenot compressed in the current implementation. Suggestioncompression and encryption will be implemented in thefuture work. Notice that the received data size varies overtime. This is because of the number of Suggestions receivedis influenced by the number of abnormal events detected. Forinstance, if CPU utilization of host A is reached 90% of itscapacity. A VM migration Suggestion (Code list 1) will besent to the Service Delegator. The dotted blue-line indicatesthe total bandwidth consumption. With the scale of 260server and 300 VMs, the average bandwidth consumption isapproximately 30KB/sec. If the MTPI is set to be longer(for instance, five minutes), the required network bandwidthwill be lowered significantly (to approximately 6KB/sec).MT Type MTSCMTVCMTVU0.8 0.4 1.2 0.050100150200250300350400450500Number of VMsFigure 6.Comparison of compressed Metric Template sizesystem was deployed on instance-1 acting as the Exchangeserver. Both Analyser and Data Modeller were deployedon instance-4, but they run as separated processes. The Rframework 2.15.2 was deployed on instance-5 acting as theMathematical Engine. The Service Delegator componentsran on a VM. The VM was configured with single virtualCPU (2.2GHz), 512M B memory, 10GB local storage, andUbuntu server 12.10 (64-bit). The private cloud simulatorran on a Windows 7 system with configuration of quad-coreCPU (2.2GHz), 8GB memory and 500GB local storage. Itsimulated 260 servers and 50 500 VMs depending on thepurpose of the simulation. MTPI for MTSC/MTVC/MTVUwere set to one minute across all experiments. A collectionof real-world server workloads were given to VMs duringsimulation. The complexity of the architecture was fullyexecised, and important aspects were evaluated.B. Cloud Snapshot transmission costThe distributed deployment of the Analyser and DataModeller modules require a snapshot of the current statusof the cloud, supporting optimal decision making and datamodelling. One of the main design concerns was the CloudSnapshot transmission overhead between Cloud Modellermodule and Analyser/Data Modeller modules (indicated bythe dashed line ellipse B, in Figure 4). Figure 7 shows theserialized, compressed, and encoded Cloud Snapshot sizeincreases linearly and slowly with the number of VMs.The Cloud Snapshot transmission time counter starts at thebeginning of the cloud model object serialization processat the Cloud Modeller, and stops at the end of the CloudA. Service Delegator and OMS communication costFigure 5 shows the network bandwidth consumption forService Delegator and OMS communication (indicated bythe dashed line ellipse A, in Figure 4) over 30 minutes.In this experiment, only MTSC, MTVC, and MTVU wereused. The simulator simulated 260 servers and 300 VMs.706

125C. Calendar-based Storage Model113.85105.23CS Size (KB)Figure 9 shows the comparison of the original data sizeand its data mode object size. With larger data sets theCBSM can save more storage space. The original datawas one week of CPU utilization for 300 VMs. Samplingintervals were set to be 1 5 minutes (corresponds to theMTPI). The bigger MTPI indicates less metric readings.Both original data and data model objects were compressedusing a ZIP stream algorithm provided by the standard Javapackage. Data models were built using a Local PolynomialFitting algorithm provided by the R framework. Because alldata models were built with a fixed sampling interval (1hour), therefore the size of data models doesn’t change withthe 7525050100150200250300350400450500Number of VMsCS Transmission Cost (sec)Figure 7.Comparison of compressed Cloud Snapshot sizeIV. R ELATED W ORK24.49Clouds and their services operate in an virtualised environment. The adoption of virtualization technology decouples the traditional relationship between operating systemsand physical machines. It offers opportunities for insertinglayers of infrastructure management and operation automation.Cloud technology vendors, such as Cisco, Microsoftand VMware, provide their own on-site, proprietary cloudsystem management suites. Cisco Systems has outlined anotable cloud capacity management strategy based on ITILv3 (Information Technology Infrastructure Library Version3) reference architecture. Its key concept is to build aCloud Capacity Model [1]. The capacity model consists ofthree planes: Component, Service/Domain, and Business.The Components plane contains all available resources,and they are building blocks to the Service/Domain plane.These resource building blocks are divided into differentcomponent catalogues. Example component catalogues arenetwork, storage, and compute. In the Service/Domain plane,each component catalogue associates with a Service Model,Demand Model, and Service Forecast. The Business planeconsists of Service Catalogue and Business Forecast. Capacity plans are produced based on the Business Forecastand Service Forecast as the two primary inputs. Microsoftas a major cloud player, also provides a private cloudmanagement solution – VMM (Virtual Machine Manager)[2]. A noteworthy component of VMM is the Library. ALibrary acts as a resource repository. It contains variousresources including VM images, scripts, and best practicetemplates, etc. Leveraging the Library maximizes the resource reusability and avoids error-prone tasks. VMwareCapacityIQ [4] is another cloud infrastructure managementsolution offered by VMware Technologies. Its basic function is to collect statistic/history information about cloudobjects for management personnel. Its unique capability isof modelling potential changes to the virtualized environment of clouds. These solutions are categorised as passivemanagement. They require IT personnel to operate and 100150200250300350400450500Number of VMsFigure 8.Metric Template sizeSnapshot de-serialization process at the Analyser/Data Modeller. Figure 8 shows the Cloud Snapshot transmission timefor 50 500 VMs and the cost in time increases rapidlywith the number of VMs. Because both Analyser and DataModeller are scheduled processes, and they are primarilyused for consolidation of VMs, resource provisioning, longterm decision support, and storage conservation, therefore,such a scale of time delay is tolerable.Data Size (MB)3.621.85Type1.25Data1.2Model0.950.7712345MTPI (minutes)Figure 9.Comparison of data size and data model object size707

VI. ACKNOWLEDGEMENTof automation. In contrast, this work aimed to provide anautomated operation management solution.There are also third parties providing cloud operationmanagement solutions. BMC Software [5] provides comprehensive solutions for managing clouds services and infrastructures. Service performance is proactively analysed byan Application Behaviour Learning Engine, which is basedon statistical analytic techniques, and cloud resources arecontinuously optimised [5] [9]. Netuitive [6] is a similarcommercially available solution. Architecturally, it consistsof three tiers: Aggregation, Correlation, and Presentation.The Aggregation tier transmits cloud information to the service (Correction and Presentation) tier. The Correlation tierprovides self-learning mechanisms that learn cloud

the term Proactive Management has been stressed by many industrial cloud management solution pioneers [5] [6] [7]. Proactive management, in essence, deals with the man-agement life-cycle of information collection, event detec-tion/analysis, and response. Consideration must also be given to aspects such as transmission of metric data to