Re-architecting For Nonstop Innovation

Transcription

Re-architectingfor nonstopinnovationUnlocking productivity, scalability,and lower costs for cloud nativesCloud-nativearchitecturePrinciples& practicesHow to getstartedPage 04Page 11Page 21

Table of Contents01ChapterCloud-nativearchitectureWhy move to a cloud-native architecture?04What is a cloud-native architecture?06Migrating to cloud native0802ChapterPrinciples& practicesMicroservices architecture principles and practices11Reference architecture13Containerize production services13Create effective CI/CD pipelines15Focus on security17

Table of Contents03ChapterGettingstartedGetting Started21Further reading22

401ChapterCloud-nativearchitectureWhy move to a cloud-native architecture?Many companies build their custom software services using monolithic architectures.This approach has advantages: monolithic systems are relatively simple to design anddeploy - at least, at first. However it can become hard to maintain developer productivityand deployment velocity as applications grow more complex, leading to systems that areexpensive and time-consuming to change and risky to deploy. This paper shows how tore-architect your applications to a cloud-native paradigm that allows you to acceleratedelivery of new features even as you grow your teams, while also improving softwarequality and achieving higher levels of stability and availability.As services—and the teams responsible for them—grow, they tend to become morecomplex and harder to evolve and to operate. Testing and deployment becomes morepainful, adding new features becomes harder, and maintaining reliability and availabilitycan be a struggle.Research by Google’s DORA team shows that it is possible to achieve high levels ofsoftware delivery throughput and service stability and availability across organizations ofall sizes and domains. High-performing teams are able to deploy multiple times per day,get changes out to production in less than a day, restore service in less than an hour, andachieve change fail rates of 0-15%1.Furthermore, high performers are able to achieve higher levels of developer productivity,measured in terms of deployments per developer per day, even as they increase the sizeof their teams. This is shown in Figure 1.1 Find out how your team is performing based on these four key metrics athttps://cloud.google.com/devops/

5Low PerformersDeploy/Day/DevMed PerformersDeployment Frequency (log10(freq)): Higher is more frequent:High Performers32.521.510.50101001000Number of DevelopersFigure 1: The impact of a team’s software delivery performance on its abilityto scale developer productivity, from the 2015 State of DevOps Report 2.The rest of this paper shows how to migrate your applications to a moderncloud-native paradigm to help achieve these outcomes. By implementing thetechnical practices described in this paper, you can reach the following goals: Increased developer productivity, even as you increase your team sizes. Faster time-to-market: add new features and fix defects more quickly. Higher availability: increase the uptime of your software, reduce the rateof deployment failures, and reduce time-to-restore in the event of incidents. Improved security: reduce the attack surface area of your applications,and make it easier to detect and respond rapidly to attacks and newlydiscovered vulnerabilities. Better scalability: cloud-native platforms and applications make it easy toscale horizontally where necessary - and to scale down too. Reduce costs: a streamlined software delivery process reduces the costsof delivering new features, and effective use of cloud platforms substantiallyreduces the operating costs of your services. 20152 -devops 2015.pdf

6What is a cloud-nativearchitecture?Monolithic applications must be built, tested, anddeployed as a single unit. Often, the operatingsystem, middleware, and language stack for theapplication are customized or custom-configuredfor each application. The build, test, and deploymentscripts and processes are typically also unique toeach application. This is simple and effective forgreenfield applications, but as they grow, itbecomes harder to change, test, deploy, andoperate such systems.Furthermore, as systems grow, so does the sizeand complexity of the teams that build, test, deploy,and operate the service. A common, but flawedapproach, is to split teams out by function, leading tohand-offs between teams that tend to drive up leadtimes and batch sizes and lead to significant amountsof rework. DORA’s research shows high-performingteams are twice as likely to be developing anddelivering software in a single, cross-functional team.Symptoms of this problem include: Long, frequently broken build processes Infrequent integration and test cycles Increased effort required to support the buildand test processes Loss of developer productivity Painful deployment processes that mustbe performed out of hours, requiringscheduled downtime Significant effort in managing the configurationof test and production environmentsIn the cloud-native paradigm,in contrast³: Complex systems are decomposed into servicesthat can be independently tested and deployedon a containerized runtime (a microservices orservice-oriented architecture); Applications use standard platform-providedservices, such as database managementsystems (DBMS), blob storage, messaging, CDN,and SSL termination; A standardized cloud platform takes care ofmany operational concerns, such as deployment,autoscaling, configuration, secrets management,monitoring, and alerting. These servicescan be accessed on-demand by applicationdevelopment teams; Standardized operating system, middleware,and language-specific stacks are provided toapplication developers, and the maintenanceand patching of these stacks are done outof-band by either the platform provider or aseparate team; A single cross-functional team can beresponsible for the entire software deliverylifecycle of each service.3 This is not intended to be a complete description of what “cloud native” means: for a discussion ofsome of the principles of cloud-native architecture, visit cture-what-it-is-and-how-to-master-it.

7and demonstrate compliance. Many cloudThis paradigm provides many benefits: providers maintain compliance with riskmanagement frameworks such as SOC2 andFaster delivery: Since services are now smalland loosely coupled, the teams associatedwith those services can work autonomously.FedRAMP, meaning applications deployedon top of them only have to demonstratecompliance with residual controls notThis increases developer productivity anddevelopment velocity. Reliable releases: Developers can rapidlybuild, test, and deploy new and existingservices on production-like test environments.Deployment to production is also a simple,atomic activity. This substantially speeds upthe software delivery process and reducesthe risk of deployments. Lower costs: The cost and complexity of testand production environments is substantiallyreduced because shared, standardizedservices are provided by the platform,and because applications run on sharedphysical infrastructure. Better security: Vendors are now responsiblefor keeping shared services, such as DBMS andmessaging infrastructure up-to-date, patched,and compliant. It’s also much easier to keepapplications patched and up-to-date becausethere is a standard way to deploy andmanage applications. Higher availability: Availability and reliabilityof applications is increased because ofthe reduced complexity of the operationalenvironment, the ease of making configurationchanges, and the ability to handle autoscalingand autohealing at the platform level. implemented at the platform layer.Simpler, cheaper compliance: Mostinformation security controls can beimplemented at the platform layer, making itsignificantly cheaper and easier to implementHowever, there are some trade-offsassociated with the cloud-native model: All applications are now distributed systems,which means they make significant numbersof remote calls as part of their operation. Thismeans thinking carefully about how to handlenetwork failures and performance issues, andhow to debug problems in production. Developers must use the standardisedoperating system, middleware, and applicationstacks provided by the platform. This makeslocal development harder. Architects need to adopt an event-drivenapproach to systems design, includingembracing eventual consistency.

8Migrating to cloud nativeMany organizations have adopted a “lift-andshift” approach to moving services to the cloud.In this approach, only minor changes are requiredto systems, and the cloud is basically treated asa traditional datacenter, albeit one that providessubstantially better APIs, services, and managementtooling compared with traditional datacenters.However, lift-and-shift by itself provides noneof the benefits of the cloud-native paradigmdescribed above.Many organizations stall at lift-and-shift becauseof the expense and complexity of moving theirapplications to a cloud-native architecture, whichrequires rethinking everything from applicationarchitecture to production operations and indeedthe entire software delivery lifecycle. This fearis rational: many large organizations monolithicapplicationbeen burned by failed multi-year, “big bang”replatforming efforts.The solution is to take an incremental, iterative,and evolutionary approach to re-architecting yoursystems to cloud native, enabling teams to learnhow to work effectively in this new paradigm whilecontinuing to deliver new functionality: an approachthat we call “move-and-improve”.A key pattern in evolutionary architecture is knownas the strangler fig application4. Rather thancompletely rewriting systems from scratch, writenew features in a modern, cloud-native style, buthave them talk to the original monolithic applicationfor existing functionality. Gradually shift existingfunctionality over time as necessary for theconceptual integrity of the new services, as shownin Figure oduleFigure 2: Using the strangler fig pattern to incrementally re-architect your monolithic application4 Described in tion.html

9Here are three important guidelines forsuccessfully re-architecting:First, start by delivering new functionality fastrather than reproducing existing functionality. Thekey metric is how quickly you can start deliveringnew functionality using new services, so that you canquickly learn and communicate good practice gainedfrom actually working within this paradigm. Cut scopeaggressively with the goal of delivering something toreal users in weeks, not months.Second, design for cloud native. This meansusing the cloud platform’s native services forDBMS, messaging, CDN, networking, blob storageand so forth, and using standardized platformprovided application stacks wherever possible.Services should be containerized, making use ofthe serverless paradigm wherever possible, andthe build, test, and deploy process should be fullyautomated. Have all applications use platformprovided shared services for logging, monitoring,and alerting. (It is worth noting that this kind ofplatform architecture can be usefully deployed forany multi-tenant application platform, including abare-metal on-premises environment.) A high-levelpicture of a cloud-native platform is shown inFigure 3, below.InternetManaged serviceAPI serviceManaged service broker(DB, storage, identity, httpstermination etc.) serviceprovider / 3rd partyContainer schedulerservice providerresponsibility deployments, app& svc mgmt, routing, authservice providerresponsibilityservice providerresponsibilityDynamic routingManaged serviceManaged serviceservice providerresponsibility(DB, storage, identity, httpstermination etc.) serviceprovider / 3rd party(DB, storage, identity, httpstermination etc.) serviceprovider / 3rd partyApplicationApplicationApplicationcustomer responsibilitycustomer responsibilitycustomer responsibilityApplication StackApplication StackApplication Stackservice provider or customerservice provider or customerservice provider or customerContainer heat layerHeat base imageservice provider responsibility hardened, includes intrustion detection, threat and vunerability detectionInfrastructure layerservice provider responsibility IaaS managed using infrastructure-as-code paradigmFigure 3: High-level anatomy of a cloud platform

10Finally, design for autonomous, loosely-coupled teams who cantest and deploy their own services. Our research shows thatthe most important architectural outcomes are whether softwaredelivery teams can answer “yes” to these six questions: We can make large-scale changes to the design of our systemwithout the permission of somebody outside the team; We can make large-scale changes to the design of our systemwithout depending on other teams to make changes in theirsystems or creating significant work for other teams; We can complete our work without communicating andcoordinating with people outside the team; We can deploy and release our product or service on demand,regardless of other services it depends upon; We can do most of our testing on demand, without requiring anintegrated test environment; We can perform deployments during normal business hourswith negligible downtime.Check on a regular basis whether teams are working towardsthese goals, and prioritize achieving them. This usually involves rethinking both organizational and enterprise architecture.In particular, organizing teams so that all the various roles neededto build, test, and deploy software, including product managers, areworking together and using modern product management practicesto build and evolve the services they are working on is crucial.This need not involve changes to organizational structure. Simplyhaving those people work together as a team on a day-to-daybasis (sharing a physical space where possible) rather than havingdevelopers, testers, and release teams operating independently canmake a big difference to productivity.Our research showsthat the extent to whichteams agreed with thesestatements stronglypredicted high softwareperformance: the abilityto deliver reliable, highlyavailable services multipletimes per day. This, inturn, is what enableshigh-performing teamsto increase developerproductivity (measuredin terms of number ofdeployments per developerper day) even as thenumber of teams increases.

11Chapter02Principles &practicesMicroservices architecture principles and practicesWhen adopting a microservices or service-oriented architecture, there are someimportant principles and practices you must ensure you follow. It’s best to be verystrict about following these from the beginning, as it is more expensive to retrofitthem later on. Every service should have its own database schema. Whether you’re using arelational database or a nosql solution, each service should have its own schemathat no other service accesses. When multiple services talk to the same schema,over time the services become tightly coupled together at the database layer.These dependencies prevent services from being independently tested anddeployed, making them harder to change and more risky to deploy. Services should only communicate through their public APIs over thenetwork. All services should expose their behavior through public APIs, andservices should only talk to each other through these APIs. There should be no“back door” access, or services talking directly to other services’ databases.This avoids services becoming tightly coupled, and ensures inter-servicecommunication uses well-documented and supported APIs. Services are responsible for backwards compatibility for their clients. Theteam building and operating a service is responsible for making sure that updatesto the service don’t break its consumers. This means planning for API versioningand testing for backwards compatibility, so that when you release new versions,you don’t break existing customers. Teams can validate this using canaryreleasing. It also means making sure deployments do not introduce downtime,using techniques such as blue/green deployments or staged roll-outs.

12 Create a standard way to run services on development workstations.Developers need to be able to stand up any subset of production services ondevelopment workstations on demand using a single command. It should alsobe possible to run stub versions of services on demand—make sure you useemulated versions of cloud services that many cloud providers supply to helpyou. The goal is to make it easy for developers to test and debug services locally. Invest in production monitoring and observability. Many problems inproduction, including performance problems, are emergent and caused byinteractions between multiple services. Our research shows it’s important to havea solution in place that reports on the overall health of systems (for example, aremy systems functioning? do my systems have sufficient resources available?)and that teams have access to tools and data that helps them trace, understand,and diagnose infrastructure problems in production environments, includinginteractions between services. Set service-level objectives (SLOs) for your services and perform disasterrecovery tests regularly. Setting SLOs for your services sets expectations onhow it will perform and helps you plan how your system should behave if a servicegoes down (a key consideration when building resilient distributed systems.)Test how your production system behaves in real life using techniques such ascontrolled failure injection as part of your disaster recovery testing plan—DORA’sresearch shows that organizations that conduct disaster recovery tests usingmethods like this are more likely to have higher levels of service availability. Theearlier you get started with this the better, so you can normalize this kind ofvital activity.This is a lot to think about, which is why it’s important to pilot this kind of work witha team that has the capacity and capability to experiment with implementing theseideas. There will be successes and failures—it’s important to take lessons from theseteams and leverage them as you spread the new architectural paradigm out throughyour organization.Our research shows that companies that succeed use proofs-of-concept and provideopportunities for teams to share learnings, for example by creating communitiesof practice. Provide time, space, and resources for people from multiple teams tomeet regularly and exchange ideas. Everybody will also need to learn new skills andtechnologies. Invest in the growth of your people by giving them budget for buyingbooks, taking training courses, and attending conferences. Provide infrastructureand time for people to spread institutional knowledge and good practice throughcompany mailing lists, knowledge bases, and in-person meetups.

13Reference architectureIn this section, we’ll describe a referencearchitecture based on the following guidelines: Use containers for production services anda container scheduler such as Cloud Run orKubernetes for orchestration Create effective CI/CD pipelines Focus on securityContainerize production servicesThe foundation of a containerized cloud applicationis a container management and orchestrationservice. While many different services have beencreated, one is clearly dominant today: Kubernetes.According to Gartner, “Kubernetes has emerged asthe de facto standard for container orchestration,Pod APod BContainer 1Container 2with a vibrant community and support from mostof the leading commercial vendors.” Figure 4summarizes the logical structure of aKubernetes cluster.Kubernetes defines an abstraction called a pod.Each pod often includes just one container, like podsA and B in Figure 4, although a pod can contain morethan one, as in pod C. Each Kubernetes service runsa cluster containing some number of nodes, eachof which is typically a virtual machine (VM). Figure 4shows only four VMs, but a real cluster might easilycontain a hundred or more. When a pod is deployedon a Kubernetes cluster, the service determineswhich VMs that pod’s containers should run in.Because containers specify the resources theyneed, Kubernetes can make intelligent choices aboutwhich pods are assigned to each VM.Pod CContainer 3Container 4KubernetesClusterReplicasFigure 4: A Kubernetes cluster runs workloads, with each pod comprising one or more containers.

14Part of a pod’s deployment information is anindication of how many instances—replicas—ofthe pod should run. The Kubernetes service thencreates that many instances of the pod’s containersand assigns them to VMs. In Figure 4, for example,pod A’s deployment requested three replicas, as didpod C’s deployment. Pod B’s deployment, however,requested four replicas, and so this example clustercontains four running instances of container 2.And as the figure suggests, a pod with more thanone container, such as pod C, will always have itscontainers assigned to the same node.Kubernetes also provides other services, including: Monitoring running pods, so if a container fails,the service will start a new instance. This makessure that all the replicas requested in a pod’sdeployment remain available. Load balancing traffic, spreading requestsmade to each pod intelligently across acontainer’s replicas. Automated zero-downtime rollout of newcontainers, with new instances graduallyreplacing existing instances until a new version isfully deployed.Automated scaling, with a cluster autonomouslyadding or deleting VMs based on demand.Cloud CodeCommit andpush changesGitHubDetect new commitand trigger a buildCloud BuildRun tests andpush imageContainerRegistryDetect scannedimageCloud BuildDeploy to GKEGoogle KubernetesEngineFigure 5: A CI/CD pipeline has several steps, fromwriting code to deploying a new container.

15Create effective CI/CD pipelinesSome of the benefits of refactoring a monolithicapplication, such as lower costs, flow directlyfrom running on Kubernetes. But, one of the mostimportant benefits—the ability to update yourapplication more frequently—is only possible if youchange how you build and release software. Gettingthis benefit requires you to create effective CI/CDpipelines in your organization.Continuous integration relies on automated buildand test workflows that provide fast feedback todevelopers. It requires every member of a teamworking on the same code (for example, the codefor a single service) to regularly integrate their workinto a shared mainline or trunk. This integrationshould happen at least daily per developer, witheach integration verified by a build process thatincludes automated tests. Continuous delivery aimsat making deployment of this integrated code quickand low risk, largely by automating the build, test,and deployment process so that activities such asperformance, security, and exploratory testing canbe performed continuously. Put simply, CI helpsdevelopers detect integration issues quickly, whileCD makes deployment reliable and routine.To make this clearer, it’s useful to look at a concreteexample. Figure 5 shows how a CI/CD pipeline mightlook using Google tools for containers running onGoogle Kubernetes Engine.It’s useful to think of the process in two chunks, asshown in Figure 6:Local Development: The goal here is to speed upan inner development loop and provide developerswith tooling to get fast feedback on the impact oflocal code changes. This includes support for linting,auto-complete for YAML, and faster local builds.Remote Development: When a pull request (PR)is submitted, the remote development loop kicksoff. The goal here is to drastically reduce the time ittakes to validate and test the PR via CI and performother activities like vulnerability scanning andbinary signing, while driving release approvals in anautomated fashion.Outer loopCICloud CodeBuildInner DevLoopCDDevelopmentClusterPath toProductionFigure 6: Local and remote development loops

16Here is how Google Cloud toolscan help during this process:Local development: Making developers productivewith local application development is essential.This local development entails building applicationsthat can be deployed to local and remote clusters.Prior to committing changes to a source controlmanagement system like GitHub, having a fast localdevelopment loop can ensure developers get to testand deploy their changes to a local cluster.Google Cloud as a result provides Cloud Code. CloudCode comes with extensions to IDEs, such as VisualStudio Code and Intellij to let developers rapidlyiterate, debug, and run code on Kubernetes. Underthe covers Cloud Code uses popular tools, such asSkaffold, Jib, and Kubectl to enable developers toget continuous feedback on code in real time.Continuous integration: With the new CloudBuild GitHub App, teams can trigger builds ondifferent repo events - pull requests, branch, ortag events right from within GitHub. Cloud Buildis a fully serverless platform and scales up anddown in response to load with no requirement topre-provision servers or pay in advance for extracapacity. Builds triggered via the GitHub Appautomatically post status back to GitHub. Thefeedback is integrated directly into the GitHubdeveloper workflow, reducing context switching.Artifact Management: Container Registry isa single place for your team to manage Dockerimages, perform vulnerability scanning, and decidewho can access what with fine-grained accesscontrol. The integration of vulnerability scanning withCloud Build lets developers identify security threatsas soon as Cloud Build creates an image and storesit in the Container Registry.Continuous delivery: Cloud Build uses build stepsto let you define specific steps to be performed asa part of build, test, and deploy. For example, oncea new container has been created and pushed toContainer Registry, a later build step can deploy thatcontainer to Google Kubernetes Engine (GKE) orCloud Run - along with the related configuration andpolicy. You can also deploy to other cloud providersin case you are pursuing a multi-cloud strategy.Finally, if you are someone looking to pursue GitOpsstyle continuous delivery, Cloud Build lets youdescribe your deployments declaratively using files(for example, Kubernetes manifests) stored ina Git repository.Deploying code isn’t the end of the story, however.Organizations also need to manage that code whileit executes. To do this, GCP provides operationsteams with tools, such as Cloud Monitoring,and Cloud Logging.Using Google’s CI/CD tools certainly isn’t requiredwith GKE—you’re free to use alternative toolchains ifyou wish. Examples include leveraging Jenkins for CI/CD or Artifactory for artifact management.If you’re like most organizations with VM-basedcloud applications, you probably don’t have awell-oiled CI/CD system today. Putting one in placeis an essential part of getting benefits from yourre-architected application, but it takes work. Thetechnologies needed to create your pipelines areavailable, due in part to the maturity of Kubernetes.But the human changes can be substantial. Thepeople on your delivery teams must becomecross-functional, including development, testing,and operations skills. Shifting culture takes time,so be prepared to devote effort to changing theknowledge and behaviors of your people as theymove to a CI/CD world.

17Focus on securityRe-architecting monolithic applications to a cloudnative paradigm is a big change. Unsurprisingly,doing this introduces new security challengesthat you’ll need to address. Two of the mostimportant are: Securing access between containers Ensuring a secure software supply chainThe first of these challenges stems from an obviousfact: breaking your application up into containerizedservices (and perhaps microservices) requires someway for those services to communicate. And eventhough they’re all potentially running on the sameKubernetes cluster, you still need to worry aboutcontrolling access between them. After all, youmight be sharing that Kubernetes cluster with otherapplications, and you can’t leave your containersopen to these other apps.requests these other containers are authorized tomake. It’s typical today to solve this problem (andseveral others) by using a service mesh. A leadingexample of this is Istio, an open source projectcreated by Google, IBM, and others. Figure 7 showswhere Istio fits in a Kubernetes cluster.As the figure shows, the Istio proxy intercepts alltraffic between containers in your application. Thislets the service mesh provide several useful serviceswithout any changes to your application code. Theseservices include: Security, with both service-to-serviceauthentication using TLS and end-userauthentication Traffic management, letting you control howrequests are routed among the containers inyour application Observability, capturing logs and metrics ofcommunication among your containersControlling access to a container requiresauthenticating its callers, then determining whatService Mesh CapabilitiesSvc Registry & Envoy Configssvc name ip endpointsSecurityTraffic MgtTraffic split, canary rollouts, mirroringdrain, secure ingressPolicyworkload carts, mTLS, authNObservabilitymetrics, logs, tracesResiliencyrate limiting, quota, authZcircuit breaking, fault injectionProxy configuration and observability reporting via Mesh Control PlaneService AProxyProxyService-Service DataService CService BProxyFigure 7: In a Kubernetes cluster with Istio, all traffic between containers goes through this service mesh.

18GCP lets you add Istio to a GKE cluster. And eventhough using a service mesh isn’t required, don’t besurpri

Migrating to cloud native Many organizations have adopted a “lift-and-shift” approach to moving services to the cloud. In this approach, only minor changes are required to systems, and the cloud is basically treated as a traditional datacenter, albeit one that pro