AWS Reference Network Architecture

Transcription

CDP Public CloudAWS Reference Network ArchitectureDate published: 2019-08-22Date modified: 2022-06-08https://docs.cloudera.com/

Legal Notice Cloudera Inc. 2022. All rights reserved.The documentation is and contains Cloudera proprietary information protected by copyright and other intellectual propertyrights. No license under copyright or any other intellectual property right is granted herein.Unless otherwise noted, scripts and sample code are licensed under the Apache License, Version 2.0.Copyright information for Cloudera software may be found within the documentation accompanying each component in aparticular release.Cloudera software includes software from various open source or other third party projects, and may be released under theApache Software License 2.0 (“ASLv2”), the Affero General Public License version 3 (AGPLv3), or other license terms.Other software included may be released under the terms of alternative open source licenses. Please review the license andnotice files accompanying the software for additional licensing information.Please visit the Cloudera software product page for more information on Cloudera software. For more information onCloudera support services, please visit either the Support or Sales page. Feel free to contact us directly to discuss yourspecific needs.Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes no responsibility norliability arising from the use of products, except as expressly agreed to in writing by Cloudera.Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered or unregisteredtrademarks in the United States and other countries. All other trademarks are the property of their respective owners.Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA,CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OFANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY ORRELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THATCLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BEFREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTIONNOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS.WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLELAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOTLIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, ANDFITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANT BASEDON COURSE OF DEALING OR USAGE IN TRADE.

CDP Public Cloud Contents iiiContentsAWS reference network architecture.4Use cases. 4Taxonomy of network architectures.5Management Console to customer cloud network.6Customer on-prem network to cloud network. 8Network architecture.8Architecture diagrams.9Component description. 11VPC. 11Subnets. 11Gateways and route tables.12Security groups. 12DNS. 14DHCP option set.16Determining the CIDR range. 16Option 1: CDP creates the VPCs and subnets.16Option 2: Existing VPC and subnets. 16DNS. 19Associating additional CIDRs to a VPC.22

CDP Public CloudCDP Public Cloud reference network architecture for AWSCDP Public Cloud reference network architecture for AWSThis topic includes a conceptual overview of the CDP Public Cloud architecture for AWS.OverviewCDP Public Cloud allows customers to set up cloud Data Lakes and compute workloads in their cloud accountson AWS, Azure, and Google Cloud. It maps a cloud account to a concept called the Environment into which allworkloads are launched. For these Data Lakes and workloads to function correctly, several elements of the cloudarchitecture need to be configured appropriately. These include things such as access permissions, networking setup,cloud storage and so on. Broadly, these elements can be configured in one of two ways: CDP can set up these elements for the customer. In this model, the customer has to provide cloud account accessto CDP via a cross-account role to create and manage these various elements. Usually, this model helps to set upa working environment quickly and try out CDP. However, many enterprise customers prefer or even mandatespecific configurations of a cloud environment for Infosec or compliance reasons. Setting up elements such asnetworking and cloud storage requires prior approvals and they would generally not prefer, or even activelyprevent, a third party vendor like Cloudera to set up these elements automatically.CDP can work with pre-created elements provided by the customer. In this model, the flow for creating the cloudData Lakes accepts pre-created configurations of the cloud environment and launches workloads within thoseboundaries. This model will be clearly more aligned with enterprise requirements. However, it brings with it therisk that the configuration might not necessarily play well with CDP requirements. As a result, customers mightface issues launching CDP workloads and the turnaround time to get to a working environment might be muchlonger and involve many tedious interactions between Cloudera and the customer cloud teams.The most complicated of these elements of the cloud environment, from our experience in working with severalenterprise customers, is the cloud network configuration. The purpose of this document is to clearly articulate thenetworking requirements needed for setting up a functional CDP Public Cloud environment into which the DataLakes and compute workloads of different types can be launched and used. It attempts to establish the different pointsof access to these workloads and establishes how the given architecture helps to accomplish this access.Along with this document, you can use the “cloudera-deploy tool” to automatically set up a model of this referencearchitecture, which can then be reviewed for security and compliance purposes.Related Informationcloudera-deploy toolUse casesThis topic covers use cases for CDP Public Cloud for AWS.CDP Public Cloud allows customers to process data in the cloud storage under a secure and governed Data Lakeusing different types of compute workloads, that are called CDP data services. Typically the lifecycle of theseworkloads is as follows: A CDP environment is set up by a CDP administrator using their cloud account. This sets up a cloud DataLake cluster with security and governance services and an identity provider for this environment. The CDPadministrator may need to work with a cloud administrator to create all the cloud provider resources (includingnetworking resources) that are required by CDP.Then one or more compute CDP data services can be launched, linked to this Data Lake. Each of these CDP dataservices typically serves a specific purpose such as data ingestion, analytics, machine learning and so on.These compute CDP data services are accessed by data consumers like data engineers, analysts or scientists. Thisis the core purpose of using CDP on the public cloud.These compute CDP data services can be long-running or ephemeral, depending on the customer needs.4

CDP Public CloudUse casesAs can be seen above, there may be two types of users for CDP who interact with it for different purposes: CDP admins - These persons are usually concerned with the launch and maintenance of the cloud environment,and the Data Lake, Data Hubs, and CDP data services running inside the environment. They use a ManagementConsole running in the Cloudera AWS account to perform these operations of managing the environment.Data consumers - These are the data scientists, analysts, engineers who use the CDP data services to process data.They mostly interact directly with the CDP data services running in their cloud account. They could access theseeither from their corporate networks (typically through a VPN) or other cloud networks their corporate owns.The above is represented in the following diagram:Taxonomy of network architecturesThis topic provides a high-level overview of each type of network architecture that CDP supports.At a high level, there are several types of network architectures CDP supports. As can be expected, each type bringsa unique trade-off among various aspects, such as ease of setup, security provided, workloads supported, and so on.This section only provides a high level overview of each type. The characteristics of each type are explained underappropriate sections in the rest of the document. The users must review the advantages and disadvantages of each ofthese taxonomies in detail before making a choice suitable to their needs.NameDescriptionTrade-offsPublicly Accessible NetworksDeploys customer workloads to hosts withpublic IP addresses. Security groups mustbe used to restrict access only to corporatenetworks as needed.Easy to set up for POCs. Low security levels.Semi-Private NetworksDeploys customer workloads to privatesubnets, but exposes services which dataconsumers need access to over a load balancerwith a public IP address. Security groupsor allow-lists (of IP addresses or ranges) onload balancers must be used to restrict accessto these public services only to corporatenetworks as needed.This option is fairly easy to set up too, butit may not solve all the use cases of access(in “Semi Private Networks”). The surfaceof exposure is reduced, and it is reasonablysecure.Fully Private NetworksDeploys customer workloads to privatesubnets and even services which dataconsumers need access to are only on PrivateIPs. Requires connectivity to corporatenetworks to be provided using solutions likeVPN Gateways, and so on.Complex to set up depending on priorexperience of establishing such connectivity,primarily due to the way the customer has tosolve the corporate network peering problem.But it is very secure.5

CDP Public CloudUse casesNameDescriptionTrade-offsFully Private Outbound Restricted networksThis is the same as Fully Private Networks.Except, in addition, Cloudera also provides amechanism for users to configure an outboundproxy or firewall to monitor or restrict thecommunication outside their networks.Most complex to set up, mainly consideringthe varied needs that data consumers wouldhave to connect outside the VPC on anevolving basis. It is also the most secure for anenterprise.Management Console to customer cloud networkThis topic explains the possible ways in which CDP Control Plane can communicate with the compute infrastructurein the customer network, in the context of the Management Console.As described previously, the CDP dmin would typically use the CDP Management Console that runs in the ‘CDPControl Plane’ to launch Data Lakes and CDP data services into their cloud accounts. In order to accomplish this, theCDP Control Plane and the compute infrastructure in the customer network (EC2 instances, or EKS clusters) shouldbe able to communicate with each other. There are the following ways in which this communication can occur:Publicly accessible networksIn this model, the compute infrastructure must be reachable over the public internet from the Management Console.While this is fairly easy to set up, it is usually not preferred by enterprise customers, as it implies that the EC2 nodesor EKS nodes are assigned public IP addresses. While the access control rules for these nodes can still be restrictedto the IP addresses of the Management Console components, it is still considered insecure for each of the networkarchitectures described earlier.Semi-private networksPublicly accessible networks are easy to set up for connectivity, both from the CDP Control Plane and the customeron-prem network, but have a large surface area of exposure as all compute infrastructure has public IP addresses.In contrast, fully private networks need special configuration to enable connectivity from the customer on-premnetwork, due to having no surface area of exposure to any of the compute infrastructure. While very secure, it is morecomplex to establish.There is a third configuration supported by CDP that provides some trade-offs between these two options. In thisconfiguration, the user deploys the worker nodes of the compute infrastructure on fully private networks as describedabove. However, the user chooses to expose UIs or APIs of the services fronting these worker nodes over a publicnetwork load balancer. By using this capability, the data consumers can access the UIs or APIs of the computeinfrastructure through these load balancers. It is also possible to restrict the IP ranges from which such access isallowed using security groups.While this option provides a trade-off between ease of setup and exposure levels, it may not satisfy all use casesrelated to communication between various endpoints. For example, some compute workloads involving Kafka or NiFiwould not benefit from having a simple publicly exposed NLB. It is recommended that customers evaluate their usecases against the tradeoff and choose an appropriately convenient and secure model of setup.Fully private networksIn this model, the compute infrastructure is not assigned any public IP addresses. In this case, communicationbetween the Control Plane and compute infrastructure is established using a 'tunnel' that originates from the customernetwork to the CDP Control Plane. All communication from the Control Plane to the compute nodes is then passedthrough this tunnel. From experience, Cloudera has determined that this is the preferred model of communication forcustomers.To elaborate on the tunneling approach, Cloudera uses a solution called Cluster Connectivity Manager (CCM).At a high level, the solution uses two components, an agent (CCM Agent) that runs on a VM provisioned in thecustomer network and a service (CCM Service) that runs on the CDP Control Plane. The CCM agent, at start-uptime, establishes a connection with the CCM service. This connection forms the tunnel. This tunnel is secured by6

CDP Public CloudUse casesasymmetric encryption. The private key is shared with the agent over cloud specific initialization mechanisms, suchas a user-data script in AWS.When any service on the CDP Control Plane wants to send a request to a service deployed on the customerenvironment (depicted in this diagram as the “logical flow”), it physically sends a request to the CCM service runningin the Control Plane. The CCM Agent and Service collaborate over the established tunnel to accept the request,forward it to the appropriate service, and send a response over the tunnel to be handed over the calling service on theControl Plane.Currently, all EKS clusters provisioned by various CDP data services are enabled with public and private clusterendpoints even under Fully Private Network setup (see Amazon EKS cluster endpoint access control). The EKSpublic endpoint is needed to facilitate the interactions between CDP Control Plane and the EKS cluster while workernodes and Kubernetes Control Plane interact over private API endpoints. There are plans to support private EKSendpoints in the future. When this occurs, the documentation will be updated to reflect the same.Fully private outbound restricted networksA variant of the Fully Private Network is one where customers would like to pass outbound traffic originating fromtheir cloud account through a proxy or firewall and explicitly allow-list URLs that are allowed to pass through. Thisis what Cloudera refers to as the ‘Outbound Restricted’ configuration. CDP Public Cloud supports such configurationtoo. In such cases, the customer must ensure the following: Users configure a proxy for the environment via CDP, as documented in Use a non-transparent proxy withCloudera Data Warehouse on AWS environments for Cloudera Data Warehouse and Using a non-transparentproxy for all other compute workloads and the Data Lake itself.Compute resources (VMs and CDP data services) can connect to the proxy or firewall via appropriate routingrules.The proxy or firewall is set up to allow connections to all hosts, IP ranges, ports, and protocol types that aredocumented in Outbound network access destinations for AWS.7

CDP Public CloudNetwork architectureNote:Given that Fully Private Networks is the recommended option of connectivity in most cases, this documentwill describe the architecture assuming a Fully Private Network setup.We will cover other architectural configurations like Semi-Private networks and Fully Private OutboundRestricted networks in future versions of the document.Customer on-prem network to cloud networkAfter compute CDP data services are launched in the customer’s cloud network, data consumers such as dataengineers, data scientists, and data analysts access services running in these CDP data services. Sometimes, CDPadministrators who set up and operate these clusters might need this access to diagnose any issues the clusters face.Examples of these include: Web UIs such as: Hue: For running SQL queries in Hive tables CML Workspaces: For accessing Machine Learning projects, models, notebooks, and so on Cloudera Manager: For Data Hubs and Data Lakes Atlas and Ranger: For metadata, governance, and security in the Data LakeJDBC endpoints: Customers can connect tools such as Tableau using a JDBC URL pointing to the Hive server.SSH access: Data engineers might log in to nodes on the compute CDP data services to run data processing jobsusing YARN, Spark, or other data pipeline tools.Kube API access: CDP data services that run on Amazon EKS (such as Cloudera Data Warehouse and ClouderaMachine Learning) also provide admin access to Kubernetes for purposes of diagnosing issues.API access: Customers can use APIs for accessing many of the services exposed via the web UIs for purposesof automation and integration with other tools, applications, or other workloads they have. For example, CMLexposes the “CML API v2” to work with Machine Learning projects and other entities.These services are accessed by these consumers from within a corporate network inside a VPN. These servicestypically have endpoints that have a DNS name, the format of which is described more completely in the DNS sectionof this chapter. These DNS names resolve to IP addresses assigned to the nodes, or load balancers fronting the ingestcontrollers of Kubernetes clusters. Note that these IP addresses are usually private IPs. Therefore, in order to beable to connect to these IPs from the on-premise network within a VPN, some special connectivity setup would beneeded- typically accomplished using technologies like VPN Peering, DirectConnect, Transit Gateways, and so on.While there are many options possible here, this document will try to describe one concrete option of achieving thisconnectivity.Related InformationCML API v2Network architectureCloudera recommends that customers configure their cloud networks as described in this chapter. This will help onboarding CDP Data Lakes, Data Hubs, and data services smoothly.Note that this network architecture only covers the “Fully Private Networks” and assumes unrestricted outboundaccess.The cdpctl tool, which is released along with this document can be used to automatically set up a model of thisreference architecture, which can then be reviewed for security and compliance purposes.8

CDP Public CloudNetwork architectureArchitecture diagramsThis topic includes diagrams illustrating the various elements of the network architecture in the customer’s cloudaccount into which CDP data services will be launched.The following diagram illustrates the configuration for a ‘Fully Private’ network that can be configured by thecustomer. This configuration can be provided by the CDP admins when they are setting up CDP environments orworkloads which will get launched into this configuration.Note the following points about the architecture: The configuration is a ‘Fully Private’ configuration - that is, the workloads are launched on nodes that do not havepublic IP addresses into a private subnet.They connect outbound to the CDP Control Plane over a fixed IP and port range.9

CDP Public Cloud Network architectureFor users to be able to connect from the customer on-prem network to the CDP workloads in the private subnet,some network connectivity setup is required. In this case, a customer’s VPN server peered to an AWS virtualprivate gateway is shown.Some of the CDP data services are based on AWS EKS clusters. Amazon EKS manages the Kubernetes ControlPlane while the worker nodes that make up the cluster get provisioned in the customer’s VPC. The EKS ControlPlane has an API endpoint for administrative purposes which is commonly referred to as "cluster endpoint". The CDPdata service itself is accessible through a service endpoint ELB.This is illustrated in the following diagram:10

CDP Public CloudNetwork architectureComponent descriptionThis topic provides an overview of the VPC, subnets, gateways and route tables, and security groups required forCDP Public Cloud for AWS.VPCAn Amazon Virtual Private Cloud (VPC) is needed for deploying workloads into the customer’s cloud account.Cloudera recommend that the VPC used for CDP is configured with properties as specified in this topic. The CIDR block for the VPC should be sufficiently large for supporting all the CDP data services you intend torun. Refer to Determining the CIDR range for understanding how to compute the CIDR block range.The VPC properties for DNS hostnames and DNS resolution must be ENABLED. DNS resolution lets Kubernetespods resolve external host names and also to support DNS hostnames. The DNS hostnames option needs to beenabled as several CDP data services rely on EFS (see Mounting on Amazon EC2 with a DNS name). Enablingthese properties is also a requirement (see Amazon EKS cluster endpoint access control) to enable private accessof EKS cluster endpoints.VPCs are associated with a DHCP Option Set. The DHCP option set for the VPCs must be set up as per thesection described in DHCP option set.SubnetsThis topic covers recommended subnet configurations for CDP Public Cloud for AWS. It’s recommended to have 3 private subnets and 3 public subnets, such that each private-public subnet pair is in adifferent availability zone (AZ). Even if a region has two AZs instead of three, it’s recommended that three privatesubnets are created, two in the same AZ. This is required to prevent cross AZ routing of traffic and to maintainQuorum-based consistency required by some services. Note that a subnet becomes ‘private’ or ‘public’ based on the routing devices it is associated with in the routetables. This is described in “Gateways and Route Tables”. The private subnets will be where the compute workloads will be launched by CDP. This ensures that thesenodes are working in an isolated and secure environment that does not have internet connectivity. The public subnet is needed to host a NAT gateway as this will allow the compute nodes to reach out to theCDP Control Plane over the internet. More on this will be described in “Gateways and Route Tables”.The CIDR block for the subnets should be sufficiently large for supporting all the CDP data services you intend torun. Refer to “Determining the CIDR Range ” for understanding how to compute the CIDR block range.The CIDR block for the subnets should not overlap with known “AWS EKS ranges for pods/services”. SeveralEKS based CDP data services in Overlay networksIn addition, you may want to ensure that the CIDR ranges assigned to the Subnets will not overlap with any ofyour on-premise network CIDR ranges, as this may be a requirement for setting up connectivity from your onpremise network to the subnets.Since Cloudera recommends ‘Fully Private’ configuration, the ‘Auto-assign public IPs’ option must be disabledfor the private subnets.A subnet can be associated with a Network ACL (NACL). However, since Cloudera works with Fully Privateconfiguration where communication is always initiated from EC2 nodes within the subnets, a NACL is generallynot useful for this configuration.Tag private subnets with a tag ‘kubernetes.io/role/internal-elb:1’. The key is the string and the value is ‘1’. CloudController Manager and AWS Load Balancer Controller both require private subnets to have this tag for automaticcreation of private ELBs. Private ELBs created in these subnets by “EKS”. This is applicable when CDP issupporting EKS versions 1.20 (which is currently the case).Related InformationGateways and route tablesAWS EKS ranges for pods/servicesDetermining the CIDR range11

CDP Public CloudNetwork architectureEKSGateways and route tablesThis topic covers recommended gateway and route table configurations for CDP Public Cloud for AWS.Connectivity from Control Plane to CDP workloads As described in Use cases, nodes in the CDP workloads will need to connect to the CDP Control Plane over theinternet to establish a ‘tunnel’ over which the CDP Control Plane can send instructions to the workloads.In order to accomplish this, there are two gateways that need to be configured - a NAT Gateway in each of thepublic subnets and an Internet Gateway at the VPC level.The private subnet hosting the CDP workloads should be configured with a route table where the default route(0.0.0.0/0) points to a NAT Gateway in the public subnet of its AZ.The public subnet hosting the NAT Gateway should be configured with a route table where the default route(0.0.0.0/0) points to an Internet Gateway the VPC is configured with.Each NAT gateway requires an elastic IP address. The VPC should contain as many elastic IP addresses as NATgateways across the AZs in the VPC.Connectivity from customer on-prem to CDP workloads As described in “Use Cases”, Data consumers will need to access data processing or consumption services in theCDP workloads. Given these are created with private IP addresses in private subnets, the customers will need toarrange for access to these addresses from their on-prem or corporate networks in specific ways.There are several possible solutions for achieving this, but one that is depicted in the Architecture diagram, uses aAWS VPN Gateway service.In this solution, the customer has to create a Virtual Private Gateway, and connect it to the VPN service on the onprem network.Security groupsDuring the specification of a VPC to CDP, the CDP admin can also specify the security groups. These are associatedwith all the workloads launched within that VPC.Security groups for Data Lakes and Data HubsDuring the specification of a VPC to CDP, the CDP Admin can also specify the security groups. These are associatedwith all the workloads launched within that VPC. The security groups can be specified in two ways:The CDP Admin can let CDP create security groups, taking a list of IP Address CIDRs as input.These will be used in allowing the incoming traffic to the hosts. The list of CIDR ranges should correspond to theaddress ranges from which the CDP data service workloads will be accessed. In a VPN peered VPC, this would alsoinclude address ranges from customer’s on-prem network. This model is useful for initial testing given the ease of setup.Alternatively, the CDP Admin can create security groups on their own and select them during the setup of the VPCand other network configuration. This model is better for production workloads, as it allows for greater control inthe hands of the CDP Admin. However, note that the CDP Admin MUST ensure that the rules are matching thisspecification.For a fully private network, security groups should be configured according to the types of access requirementsneeded by the different services in the workload

CDP Public Cloud reference network architecture for AWS A conceptual overview of the CDP Public Cloud architecture for AWS. Overview CDP Public Cloud allows customers to set up cloud Data Lakes and compute workloads in their cloud accounts on AWS, Azure, and Google Cloud. It maps a cloud account to a concept called the Environment into which all