A Service Mesh Buyer’s Guide

Transcription

A Service MeshBuyer’s GuideBy Dave TuckerA PACKET PUSHERS - IGNITION WHITEPAPERMarch 2020

Introduction3A Kubernetes Networking Refresher4What Is A Service Mesh?5Mesh Architecture7Service Mesh Capabilities7Traffic Management7Access Control9Identity Management9Connection Management9Observability10Service Discovery10Picking The Right Service Mesh11Service Mesh Interface (SMI)11Ingress11Mesh uperGloo15Network Service Mesh15Conclusions And Recommendations16About The Author172 of 17A Packet Pushers - Ignition Whitepaper 2020

IntroductionService Mesh is one of the latest Cloud Native buzzwords to add to your bingo card.Nevertheless, you might find yourself in the position where your CXO, EnterpriseArchitect, or Consultant insists that you need one.Or, if you are building an application using a microservices architecture, you mightactually need one. This paper gives you the information to make an informedchoice.A service mesh is a software package that proxies connections betweenmicroservices and allows the user to configure a variety of traffic managementfeatures. This paper looks at key features and capabilities of a service mesh, andprovides tables to help you identify which features are available among thedifferent service mesh options.The goal of this paper is to help you identify a short list of service mesh options foryour developers and operations teams to test. Open source and commerciallysupported service meshes are available.We focus mainly on using a service mesh in the context of Kubernetes, but you canalso run a mesh outside of Kubernetes if you wish to.3 of 17A Packet Pushers - Ignition Whitepaper 2020

A Kubernetes Networking RefresherNetworking in Kubernetes is complex; describing it in its entirety would be a paperon its own. Suffice to say, outside of vanilla Kubernetes, every network service relieson Container Network Interface (CNI) plugins. Most Kubernetes offerings will shipwith a plugin by default, but if you are brave and operating Kubernetes yourself,then you will need to pick a plugin.Examples of CNI plugins: Cisco ACI VMware NSX-T Project Calico Cilium OVNThe CNI plugins are responsible for the network plumbing, and implementationvaries from plugin to plugin. However, the Kubernetes network model does notvary. It consists of pods and services: Pods A pod is a collection of containers that share the same network andinter-process communication (IPC) namespaces Containers in a pod can communicate over the network via localhost,via shared volumes, or via IPC Each pod has a unique IP address Pods can communicate with each other without NATServices A service provides a stable IP address and DNS name for one or morepods. This allows the pods backing a service to be replaced.Kubernetes provides numerous ways to get traffic in to your services, but the threesuited to production use are:4 of 17A Packet Pushers - Ignition Whitepaper 2020

Ingress OptionDescriptionNodePortMaps a port on every host in the cluster. An external loadbalancer can direct traffic to this serviceLoad BalancerUses your cloud provider1 (if you have one configured) to createand configure a load balancer to deliver traffic to a NodePort oneach host.IngressYou must first install an Ingress Controller that will expose aNodePort on each host - you may need to configure an externalload balancer to point to this port.Using an Ingress resource allows each service to provide its ownrouting information. Some implementations also support otherprotocols including TCP/UDP and GRPC.The controller may be shared among multiple services.Your CNI plugin decision may also influence your choice of Ingress Controller andpossibly your service mesh.The main advantage of using an Ingress is that it has a smaller footprint because itshares a single load balancer between multiple resources.What Is A Service Mesh?A service mesh is a software package that proxies connections amongmicroservices and allows the user to configure a variety of traffic managementfeatures.In a microservices architecture, your application is decomposed into a set of small,self-contained services. This allows you to scale, update, and even replace a serviceindependently of the rest of the services that comprise an application.Microservices architectures have become confused with running everything incontainers - you can do the latter without embracing the former.5 of 17A Packet Pushers - Ignition Whitepaper 2020

Microservices create a number of infrastructure challenges: Services will most likely have their own load balancer Services need to discover each other via some means Service-to-service traffic requires access control Services should have a means to ensure that other services they talk to aregenuine to avoid spoofing or man-in-the-middle attacks Services need to be tolerant to planned and unplanned communicationinterruptions There must be a way to send logging, monitoring, and trace data to a centralizedservice The application may require an API Gateway to unify the service APIs so it can beconsumed by a UI or mobile app6 of 17A Packet Pushers - Ignition Whitepaper 2020

Mesh ArchitectureA service mesh is designed to address many of the challenges listed above. Mostmesh implementations resemble the following diagramLIn a service mesh, a sidecar proxy is added to each service to brokercommunications with other services. In the illustration above, the sidecar proxiesare represented by the purple boxes that touch each of the blue services.Service Mesh CapabilitiesTraffic ManagementOne benefit of a service mesh is that you get more traffic management optionsthan you would with a traditional load balancer because the mesh control plane iscontext-aware.7 of 17A Packet Pushers - Ignition Whitepaper 2020

From a traffic routing standpoint, you can use HTTP headers or Path Prefixes todirect traffic to a given service. For example:MatcherRuleHeader X-UserType is “Internal”Send to the instances of Service X thathave the internal labelPath Contains /billingSend to the Billing ServiceTraffic management rules let you split traffic among services, enabling a number ofdifferent deployment and testing strategies: Canary Rollouts Increase % of traffic to a new or updated service gradually while theservice is healthy A/B Testing Split a percentage of users among two different service versions Blue/Green Deployment Create a temporary route to Green Deployment and ensure it's healthy Once you’re sure Green is healthy, route traffic away from Blue Remove old route, and keep Blue around for a limited time in case ofrollback.Finally, you can configure what should happen when a destination is unavailable.This allows you to set up anything from cross-data-center failover to somethingmore mundane like rewriting unspecified API versions to v1.8 of 17A Packet Pushers - Ignition Whitepaper 2020

Access ControlRestricting service-to-service communication in a microservices architecture is hardbecause services and their constituent parts are fluid, and orchestration can movethem to a number of different hosts.Thankfully, a service mesh gives you some control because it understands thebroader context. For example, a service mesh can speak to the Kubernetes API tounderstand what services you have, and where the pods are scheduled, to ensurethat your policy is applied in the right places. Most service mesh implementationslet you write a detailed policy to restrict service-to-service communications; someeven employ a Drop-All rule by default.Aside from being able to adapt to dynamic changes with ease, it's also possible tobe more expressive in a rule set by allowing matches on connection metadata. Forexample, you could write a rule that says ‘Deny all traffic to the "payment" servicewhen the source is the "cart" service and the source version is’v2".’Identity ManagementIf you wanted to use TLS to secure connections between services without a servicemesh, you would have to set up a PKI yourself and issue client/server certificates toall of your services. PKI is complex and learning to set it up and administer itcorrectly is hard.A number of service mesh implementations recognize this and can be configured toenable automatic mutual TLS authentication. A component in the mesh controlplane issues client and server certificates to the sidecar proxies. Proxy-to-proxycommunication is automatically encrypted using mutual TLS; the clientauthenticates the server and vice versa. Pod-to-proxy communication is still in theclear, but this is less risky as the pod and sidecar proxy are co-located.Connection ManagementOne of the many joys of running a distributed application is that, on occasion, therecan be network issues that cause problems between services. Sometimes these areplanned, such as when you upgrade software, but other times they can be theresult of a real network problem.9 of 17A Packet Pushers - Ignition Whitepaper 2020

The burden is on service developers to make sure that their code can toleratenetwork glitches. Many service mesh implementations include connectionmanagement features that let you delegate that responsibility to the mesh.You can configure connection timeouts based on your requirements: either you failfast to increase performance; or you can wait for a transient network issue to pass.The appropriate timeout configuration will vary based on application use cases, butonce a timeout is exceeded, the call fails. You can then optionally configureconnection retries that will, as the name suggests, retry the call. Like timeouts, theright setting is application-dependent.Finally, there’s the concept of a circuit breaker. Circuit breakers let you set theconditions under which a circuit breaker will "flip" and prevent further connectionsto a given host within a service. This could be based on the number of concurrentconnections, number of failed connections, or perhaps latency spikes.ObservabilityLogging, monitoring, and tracing are important in any application, but perhapsmore so in a distributed application. Using a service mesh creates an additionallayer of indirection between operators and the network, so it's important to ensurethat this doesn't become a blind spot. The services and the mesh itself must be ableto send logs, metrics, and trace data as well as interoperate with application-leveltracing.Service DiscoveryService discovery is not a core function of a service mesh, but it could factor intoyour choice. Kubernetes offers the use of either DNS or Environment Variables todiscover services running in the same cluster. If you need to make services availableacross multiple clusters, then you have to wait for Kubernetes Federation2 , whichis not yet generally available.However, if you need a service discovery solution now and you're willing to usesomething that sits outside of Kubernetes, then Consul offers a rich feature set,including: Sync Kubernetes Services to Consul and vice-versa Multi-data-center support10 of 17A Packet Pushers - Ignition Whitepaper 2020

Provides a DNS-based query APIThe advantage of Consul in this context is that it can also be used as a servicemesh. More on that later.Picking The Right Service MeshIf you've made it this far, you probably still need a service mesh. It is absolutelypossible to make your own - each problem in the "What is a Service Mesh" sectioncould be solved by writing your own application or infrastructure code.However, you may prefer the convenience of having these problems solved in asingle product with a vibrant open source community, or a vendor, to support it.Service Mesh Interface (SMI)As you examine how each mesh behaves, you’ll see that every single one requires aslightly different mental model. Even though the functionality is similar, the modelin which you configure the features varies.To address this issue, the Service Mesh Interface (SMI) project is being driven byMicrosoft and its partners to help standardize service mesh constructs withinKubernetes.This is great for users, as it will allow you to move between mesh implementationswithout having to change your service configurations. Unfortunately, thespecification is still under development and very few implementations have firstparty support for it.IngressKeen readers will have noted that a lot of the functionality provided by a servicemesh is also relevant for Ingress Controllers / API gateways.It is true that some service meshes could also be used to provide Ingressfunctionality, but I would not recommend this approach. A dedicated IngressController can provide you with easier access to features like API rate limiting andauthentication.11 of 17A Packet Pushers - Ignition Whitepaper 2020

Therefore, I would advocate for integrating your mesh with a specialized IngressController, as this will provide you with the best of both worlds.Mesh ComparisonThis table compares the popular service mesh projects available at the time ofwriting (February 2020):CommercialSupportDeploymentTraffic ttingFailoverIstio 1 4 Linkerd Consul Kuma 2 Maesh 2 Connection ManagementAccessControlmTLSIstio LinkerdObservabilityTimeouts RetriesCircuitLogsBreakingMetricsTracing Consul Kuma 3Maesh Cloud SupportIstio12 of 17AKS (Azure)EKS(Amazon)GKE(Google) A Packet Pushers - Ignition Whitepaper 2020

Linkerd Consul Kuma Maesh 1 SUPPORT AVAILABLE THROUGH MANY DIFFERENT VENDORS2 NOT SHIPPING YET BUT BETA AVAILABLE3 NOT GA4 GATEWAYS PROVIDE MESH - OUTSIDE CONNECTIVITY BUT NOT BIDIRECTIONAL SYNCIstioIstio is a popular service mesh that was created by Google, IBM, and Lyft and othercontributors. It's a full-features service mesh based on the Envoy proxy.Istio's core building blocks are Virtual Services, which use routing rules to directtraffic to services. In this context, a service is a Fully Qualified Domain Name (FQDN)or a short name - i.e service name - that can resolve (via Kube SD) to an FQDN.Destination rules can also be created to split a service into subsets, for example, asubset per version of the API support. These rules can then be used in VirtualServices.One particularly interesting feature of Istio is Network Fault Injection. This lets yousimulate delays and aborts via a Virtual Service and could be used to test that theapplication fails gracefully.While support for Istio isn't available directly, Istio is the foundational component ofa number of commercial offerings: Red Hat OpenShift Service Mesh VMware NSX Service Mesh Cisco Container Platform Service Mesh13 of 17A Packet Pushers - Ignition Whitepaper 2020

LinkerdThe Linkerd service mesh has been around since 2016. It was created, andcontinues to be developed, by the founders of Buoyant. It is not based on the Envoyproxy and its key selling point is that it's fast and lightweight.Like Istio, it also supports Fault Injection, but I'm very impressed with Linkerd'sDebug Sidecar. When attached, it streams live tshark output so you can see packetsflowing from the service. It also allows you to run your own commands in thecontext of the network.ConsulConsul by HashiCorp is well known in the service discovery space. HashiCorprecently launched Consul Connect, which adds service mesh functionality.As with most HashiCorp tools, the configuration is sensible and easy to use. Notethat Consul is not just confined to Kubernetes - it can be used with a variety ofother platforms.It also claims support for multiple data centers; combined with its excellent trafficmanagement features, this allows for supporting high availability among datacenters.KumaKuma, by Kong, builds on top of the Envoy proxy.It claims to be platform agnostic and will run across data centers whether or notyou use Kubernetes.Kuma's main selling points are its simple installation and its API. While I can'tcomment on the installation, the API does look relatively simple. Some of the termsmay take getting used to; for instance, Kuma uses "dataplane" to mean "sidecarproxy" in that context.14 of 17A Packet Pushers - Ignition Whitepaper 2020

MaeshMaesh, from Containous, is a new service mesh offering from the creators of Traefik- the foundation of this mesh implementation. Maesh touts ease of configuration,which is in part thanks to its adoption of Service Mesh Interface (SMI). Another keyselling point is that it does not use sidecar proxies.The only drawback of Maesh is that because it operates without sidecars, it has tohijack DNS resolution, so your services have to be resolved using the suffix".maesh".SuperGlooSuperGloo, by Solo, is a service mesh orchestration platform and not a mesh itself.What's interesting about this approach is that it offers you the ability to changebetween meshes without changing your configuration.Alternatively, you may have situations that require two meshes, and SuperGloo willbe able to handle this too.This may sound too good to be true; the bad news is that it just might be. At thetime of this writing, SuperGloo is under active development and may be missingsome essential features.Network Service MeshNetwork Service Mesh (NSM) is not a service mesh as defined in this paper. NSMwould argue that the other meshes in this list are Application Service Meshes andthat it is fundamentally different as it operates at a L2/L3 instead of L7.NSM solves a different use-case: it wants to provide the underlying L2/L3connections that Application Service Meshes assume already exist, with a particularfocus on cluster-to-cluster communications.NSM seeks to decouple runtime domains - where your containers/VMs/bare-metalmachines run workloads - and network domains such that workloads can beprovided with the appropriate type of network based on requirements rather thanall workloads being treated the same.15 of 17A Packet Pushers - Ignition Whitepaper 2020

It does this by asking you to provide context on how you’d like your workloads tocommunicate and then implements this by creating “virtual wires” betweenservices. The actual protocol and connection type used for the wire will depend onthe requirements and the capability at each end of the tunnel.Conclusions And RecommendationsThere are a variety of service meshes available at the moment, both as open sourceprojects and with commercial backing. The art of picking the right mesh boils downto having a solid understanding of your requirements, understanding where thetrade-offs lie in attempting to implement a mesh yourself, and then looking atthose currently available to see what is a good fit.As the mesh will impact both your development team and your operations team, it’srecommended they undertake the selection process together.Perhaps the biggest trade-off of any service mesh is that you are taking thecomplexity of writing distributed applications away from developers and relying ona product to provide you with a “happy path”. This can be dangerous as issuesbecome more difficult to debug when relying on the magic underneath. As such,debugging tools should be tested thoroughly before committing to a decision.If you are in full control of your own Kubernetes installation then you have a largechoice; however if you are consuming a Kubernetes Service from one of the majorcloud providers then you may experience issues during installation andconfiguration.Items with a question mark in the comparison table do not have documentedsupport and getting this working correctly is an exercise for the reader.Finally, tooling to create Kubernetes clusters for test and development is readilyavailable - MicroK8s for example - and most meshes can easily be set up within anhour. Hands-on experience is going to be the most valuable indicator of whichmesh is best for you, so go ahead and get your hands dirty!16 of 17A Packet Pushers - Ignition Whitepaper 2020

Cloud Provider here is meant in the context of a Kubernetes Cloud Provider whichmay not actually be in the Cloud: netes-sigs/kubefedAbout The AuthorDave Tucker is a co-founder of SocketPlane, a container networking startupacquired by Docker. Dave has worn many hats in his technical career, includingengineer, project manager, founder, and consultant.You can find him online at dtucker.co.uk and on Twitter @dave tucker.17 of 17A Packet Pushers - Ignition Whitepaper 2020

Finally, there’s the concept of a circuit breaker. Circuit breakers let you set the conditions under which a circuit breaker will "flip" and prevent further connections to a given host within a service. This could be based on the number of concurrent connections, num