Talend Data Fabric

Transcription

Talend Data FabricSecurity architecture overview

ContentsSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Talend architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Here is an overview of Talend’s functional architecture. . . . . . . . . . . . . 5Talend Management Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Talend Data Inventory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Talend Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Talend Data Stewardship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Talend API Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Talend API Tester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Talend Pipeline Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Hybrid infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Data Fabric infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Computation resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Talend Management Console and Talend Pipeline Designer . . . . . . 13Talend Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Data storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Data that we collect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Data that customers use with Talend Data Fabric . . . . . . . . . . . . . 14Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Data flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Data flows between Talend Studio and Talend Data Fabric . . . . . . . . . . 15Metadata is transferred to Talend Cloud via the following URLs: . . . . 16API designs are retrieved using the following secured endpoints: . . . 16Talend Studio defaults to uploads of Talend Jobs using the followingpre-signed URLs: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Data flows between Talend Studio jobs and Talend Data Fabric . . . . . . . 17Data flows between Remote Engine and Talend Data Fabric . . . . . . . . . 172Talend Data Fabric Security Architecture Overview

MSG service URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Repository service URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Pair service URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19DTS service URL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Remote Engine service URL . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Vault gateway service URL . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Data flows in hybrid deployment between Talend Data Inventory, TalendData Preparation, Talend Data Stewardship, Talend Data Quality, and TalendData Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Public APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Security at Talend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Physical security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Security training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Secure software development . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Cloud workload protection and monitoring . . . . . . . . . . . . . . . . . . . 22Authentication, authorization, and access control . . . . . . . . . . . . . . . 23Standard access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Administrative access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Password management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Key management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24On AWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24On Azure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Vulnerability management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Backups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Disaster recovery and business continuity . . . . . . . . . . . . . . . . . . . . 25Security certifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253Talend Data Fabric Security Architecture Overview

SummaryTalend Data Fabric is a managed cloud integration platformthat makes it easy for developers and data constituents tocollect, transform, and clean data. Talend leverages securityand privacy best practices to protect both the Talendplatform and Talend, the company. Talend implementsa combination of policies, procedures, and technologiesto ensure your data is protected and secured. Talend’schief information security officer (CISO) defines the Talendsecurity strategy, architecture, and program. This documentprovides an overview of the Talend internal architecture andour policies and procedures as they pertain to employee,physical, network, infrastructure, platform, architecture,and data security.Talend is SOC 2 Type 2 and HIPAA (Health InsurancePortability and Accountability) certified.4Talend Data Fabric Security Architecture Overview

Talend architectureTalend Data Fabric is a multi-tenant platform. All managed components are hosted on eitherAmazon Web Services (AWS) or Microsoft Azure, according to customer preference.Talend Data Fabric comprises seven applications: Talend Management Console Talend API Designer Talend Data Inventory Talend API Tester Talend Data Preparation Talend Pipeline Designer Talend Data StewardshipAdditionally, Talend Studio, which runs on a local workstation, allows users to design data integration flows (or TalendJobs) and publish them to Talend Data Fabric.Here is an overview of Talend’s functional architecture.Figure 1: Talend functional architecture5Talend Data Fabric Security Architecture Overview

The table below summarizes what applications are available or can be installed where. All Talend Data Fabric applicationsare available on AWS. Some have been released on Azure. Some components can optionally be installed in a hybridconfiguration, residing on customer infrastructure. Please refer to the Hybrid Infrastructure section below for more detailsabout hybrid configurations.ComponentAmazon Web ServicesAzureHybrid InstallationTalend Management ConsoleYesYesN/ATalend Data InventoryYesYesN/ATalend Data PreparationYes-YesTalend Data StewardshipYesYesYesTalend API DesignerYes-N/A--YesYesYesN/ATalend API TesterTalend Pipeline DesignerEach of the following sections briefly describes a Talend Data Fabric application and gives an overview of its functionalarchitecture. Please refer to our website at www.talend.com for more details about each application and terms usedthroughout the document.Talend Management ConsoleTalend Management Console (TMC) is a browser-based application that provides access to all Talend Data Fabricapplications and components as well as the administrative features and configurations that surround them.TMC lets users schedule the execution of Talend Jobs via discrete components called execution engines. There are twotypes of engines: Cloud Engines are fully managed components that are provisioned, deployed, and controlled by Talend within ourplatform. Cloud Engines do not share jobs from multiple tenants; they are provisioned at execution time (per job schedules), per tenant. Remote Engines are execution agents deployed and managed by customers on their own systems, within their ownphysical or virtual (cloud) networks.6Talend Data Fabric Security Architecture Overview

Talend Data InventoryTalend Data Inventory (TDI) providesautomated tools for datasetdocumentation, quality proofing,and promotion. It identifies data silosacross data sources and targets toprovide visualization of reusable andshareable data assets.Figure 2: Talend Data Inventory functional architecture7Talend Data Fabric Security Architecture Overview

Talend Data PreparationTalend Data Preparation (TDP) allowscustomers to simplify and speed upthe process of preparing data foranalysis and other tasks. TDP allowscustomers to create, update, remove,and share datasets, then createpreparations on top of the datasetsthat can be incorporated into TalendJobs with Talend Studio.Figure 3: Talend Data Preparation functional architectureFigure 4: Talend Data Preparation functional architecture in hybrid deployment8Talend Data Fabric Security Architecture Overview

Talend Data StewardshipTalend Data Stewardship (TDS) allowscustomers to collaboratively curate,validate, and resolve conflicts in data,as well as address potential dataintegrity issues.Figure 5: Talend Data Stewardship functional architectureFigure 6: Talend Data Stewardship functional architecture in hybrid deployment9Talend Data Fabric Security Architecture Overview

Talend API DesignerTalend API Designer lets users designAPIs collaboratively and visually,then run simulations to test APIs andgenerate reference documentation.Talend API TesterTalend API Tester lets usersautomatically generate test casesfrom API contracts, then field testAPIs by grouping tests togetherthat simulate real-world examples.Users can integrate unit tests into amanaged CI/CD process to ensurequality.Figure 7: Talend API Services functional architecture10Talend Data Fabric Security Architecture Overview

Talend Pipeline DesignerTalend Pipeline Designer (TPD) allows customers to design and run data pipelines in the cloud. A data pipeline is a data integration process: a series of transformation steps applied to data. It extracts data fromcustomer-specified sources, transforms it step by step using prebuilt processors, and loads it into other datasets (destinations). Data pipelines can be started directly from TPD or scheduled in Talend Management Console. Data pipelines can be executed on Cloud Engines or Remote Engines.Figure 8: Talend Pipeline Designer functional architecture11Talend Data Fabric Security Architecture Overview

Hybrid infrastructureSome organizations use Talend in a hybrid configuration,with some components running on local devices and othersrunning on cloud platforms. The only required componentfor running Talend in a hybrid environment is the TalendStudio development environment, which is installed on localworkstations and offers similar functionality to the cloudnative Talend Pipeline Designer. Users may install additionalapplications or components in a hybrid configuration: Talend Cloud API Tester — web browser extension Remote Engine — Java-based runtime to execute Talend Jobs on-premises oron a cloud platform that you control. If you do not install Remote Engine, youwill use Cloud Engine. Remote Engine Gen2 — a Docker-based runtime to execute Talend PipelineDesigner data pipelines on-premises or on a cloud platform that you control.If you do not install Remote Engine Gen2 you will use Cloud Engine for Design.Talend12Talend Data Fabric Security Architecture Overview

Data Fabric infrastructureTalend Data Fabric is a multitenant integration environmentthat allows you to design, manage, and check datapipelines. All managed components are hosted on eitherAmazon Web Services (AWS) or Microsoft Azure, according tocustomer preference.Secrets such as passwords, keys, and certificates aremanaged via third-party technologies and products. We gointo more detail about this in the Key Management sectionbelow.Computation resourcesTalend Management Console, Talend Data Preparation, and Talend PipelineDesigner are the only Data Fabric applications that give separate computationresources to each tenant. Each is a multitenant application that is hosted andruns on AWS or (except for Talend Data Preparation) Azure.Talend Management Console and Talend Pipeline DesignerRemote Engines are deployed by customers on their own systems and thereforegiven computation resources that they manage and control.Cloud Engines are deployed within Talend Data Fabric as separate tenantspecific AWS EC2 or Azure VM instances and never shared with other tenants.Each tenant gets its own Cloud Engine instance on AWS or Azure.The live preview feature of Talend Pipeline Designer, which allows users topreview the output of processors while designing a pipeline, is executed in adedicated Remote Engine or Cloud Engine.13Talend Data Fabric Security Architecture Overview

Talend Data PreparationData that customers use with Talend Data FabricData Preparation process computations are isolated inseparate threads for each tenant. Tenants can choosewhere the computation results are stored:Whether customers use Remote Engines or Cloud Engines,their datasets remain on systems and data repositoriesthat they manage. Metadata, Designs, Talend Jobs,Artifacts, and any other objects that Talend stores toprovide services or for security reasons are isolatedvia tenant-specific schemas and tenant-specific dataencryption keys.1. In an AWS S3 bucket that the tenant manages. AWS S3credentials are not stored in Talend.2. In Talend. In this case, results are stored in the bucket/folder specified by the Configuration service, an internalTalend service.Data storageTalend works with two general types of data: data that wecollect and data that customers use with the software.Data that we collectTalend, across its cloud applications, collects onlycustomer information that it needs to provide its servicesor to manage customer accounts.NetworkTo function properly and deliver its services, Talend DataFabric may need to communicate with external thirdparty solutions. All communications between Talend DataFabric and such external solutions need to be authorizedand initiated by Talend Data Fabric. No external solutioncan communicate with Talend Data Fabric unless thecommunication was initiated by Talend Data Fabric.Talend networks and systems are protected via networkand application firewalling, visibility mechanisms, andmicro segmentation strategies.All personally identifiable information that we collect (e.g.name, country, and email address) is protected with bestsecurity practices: It is encrypted at rest via AES-256 and intransit via TLS 1.3.No payment information is stored in Talend Data Fabric.We rely on third-party vendors to collect and managepayment information.14Talend Data Fabric Security Architecture Overview

Data flowsThis section gives an overview of the data flows between Talend Data Fabric applicationsand components.Data flows between Talend Studio and Talend Data FabricCloud Engine data flowsStatus & logs (HTTPS)Metadata in transit (HTTPS)Customer data in transitTalendDataFabricCloud EngineFirewall Cloud files Data warehouse SaaS appsFirewallTalend StudioOn-premises apps & databasesFigure 9: Talend data flows when using Cloud EnginesThe types of data that can be exchanged between Talend Studio and Talend Data Fabric include:a) Task artifact binariesb) Task artifact metadata (e.g. context variables and parameters)c) Talend API Designer definitionsUsers’ credentials (e.g. login name and password or API token generated in TMC) are required to authorize the transfer.15Talend Data Fabric Security Architecture Overview

Metadata is transferred to Talend Cloud via the following URLs:CloudAWSAzureRegionUSTalend Inventory service tps://tmc.us.cloud.talend.com/inventoryAPI designs are retrieved using the following secured endpoints:CloudAWSAzureRegionAPI Design service rojects/{projectId}Talend Studio defaults to uploads of Talend Jobs using the following pre-signed URLs:CloudAWSRegionUSEuropeAsia-PacificAzure16USS3 pre-signed lend.comTalend Data Fabric Security Architecture Overview

Data flows between Talend Studio jobs and Talend Data FabricTalend Studio has three components that can communicate with Talend Data Preparation on Talend Data Fabric.1. tDatasetInput: Calls Talend to retrieve the content of a dataset — more details here2. tDatasetOutput: Calls Talend to post the content of a dataset — more details here3. tDataprepRun DI and Spark: DI component: Calls Talend to list preparations, and at runtime to retrieve each row that needs to be prepared Spark component: Calls Talend to list preparations, and at runtime to retrieve the chosen preparation steps,lookup datasets, and semantic types More details hereData flows between Remote Engine and Talend Data FabricRemote Engine data flowsStatus & logs (HTTPS)Metadata in transit (HTTPS)Customer data in transitTalendDataFabric Cloud files Data warehouse SaaS appsFirewallFirewallTalend StudioRemote EngineOn-premises apps & databasesFigure 10: Talend data flows when using Remote Engines17Talend Data Fabric Security Architecture Overview

As mentioned earlier, Talend never initiates connections with Remote Engines. Remote Engines must initiate allconnections to Talend. Once a connection is established, all data is sent encrypted over HTTPS.Here are the types of data that can be exchanged between Remote Engines and Talend:a) Status information and metricsb) Lifecycle commandsc) Task artifact metadatad) Logse) Task artifact binariesThe next sections discuss each data type in the scope of REST service URLs that are being targeted and the correspondingsystems behind them. There are URLs for the US, Europe, and Asia-Pacific regions.MSG service URLThis is the service URL of the primary gateway to Talend’s ActiveMQ cluster. Data of types a) to d) in the list above are senton this service channel. Remote Engine status information and lifecycle commands are the first data sent over the wire.This path is a control path to schedule flow deployments and capture execution status (success, fail). Other informationtransferred is the number of rows successfully processed or being rejected. This also includes the final success message.CloudAWSAzure18RegionMsg service -west.cloud.talend.comTalend Data Fabric Security Architecture Overview

Repository service URLThis is the service URL of the primary access point for Talend Job and action binaries. This REST service provides access toNexus repositories, which are only accessible via HTTPS and unique Nexus credentials, which are created during RemoteEngine pairing.CloudAWSAzureRegionRepo service o.us-west.cloud.talend.comPair service URLThis is the service URL used during initial pairing of the Remote Engine to its account. It is used to send the heartbeat,availability, and status of the engine itself. It is only accessible via HTTPS.CloudAWSAzureRegionDTS service -west.cloud.talend.comDTS service URLThis is the service URL of the Talend token generation service. It is used to create one-time, time-limited tokens toauthorize file uploads from the Remote Engine to Talend. The file uploads are HTTPS POSTs with logs or resource files.CloudAWSAzure19RegionDTS service -west.cloud.talend.comTalend Data Fabric Security Architecture Overview

Remote Engine service URLThis is the service URL of the Talend token generation service. It is used to create one-time, time-limited tokens toauthorize file uploads from the Remote Engine to Talend. The file uploads are HTTPS POSTs with logs or resource files.CloudAWSAzureRegionRemote Engine service mUSengine.us-west.cloud.talend.comVault gateway service URLThis service is used with Remote Engine Gen2 to decrypt each tenant’s sensitive data.CloudAWSAzure20RegionVault Gateway service d.talend.comTalend Data Fabric Security Architecture Overview

Data flows in hybrid deployment between Talend Data Inventory, Talend Data Preparation,Talend Data Stewardship, Talend Data Quality, and Talend Data FabricGuiding principle — Talend applications and components initiate HTTPS connections. Talend Data Fabric never initiatesany connection to these applications.Here are the types of data that can be exchanged between these applications and Talend Data Fabric:a) During user login: Client ID and client secret (as defined in the OIDC specification) of the installed application is used toauthorize its communication with Talend Data Fabric.b) After user login: A JSON Web Token (JWT) that represents the user’s identity, metadata, and claims is transferred backto the application.Public APIsIn addition to the data flows between Talend applications, Talend exposes public APIs that let developers automateworkflows. These APIs are secured with Personal Access Tokens generated with TMC. Security at TalendTalend’s security organization consists of a dedicated team of security experts distributed across the company who workclosely with the Talend CISO. Their mission is to protect Talend and its clients with security best practices. This teamsupports all aspects of Talend business, including Talend development and operations. The responsibility of Talendsecurity rolls up to the CISO, who also defines Talend security strategy, architecture, and program.21Talend Data Fabric Security Architecture Overview

Security at TalendTalend’s security organization consists of a dedicated team of security experts distributedacross the company who work closely with the Talend CISO. Their mission is to protectTalend and its clients with security best practices. This team supports all aspects of Talendbusiness, including Talend development and operations. The responsibility of Talendsecurity rolls up to the CISO, who also defines Talend security strategy, architecture,and program.Physical securityTalend maintains security controls to preventunauthorized physical access to buildings and data centersand to protect its systems and software, and by extensionthe Talend environment, from damage, interruption,misuse, or theft.Authorizations are reviewed regularly, and access ismonitored continuously.Security trainingAll Talend employees are trained on security bestpractices. All Talend employees involved in the Talenddevelopment lifecycle, from creation to deployment andoperation, are guided through trainings, reviews, anddrills.Secure software developmentTalend’s security organization is involved throughout thecreation of any new application, capability, or feature.Our security experts conduct architecture, design, andcode reviews.22Software composition analysis (SCA) and static securityvulnerability (SAST) scans are integrated in the softwaredevelopment lifecycle.Talend implements a Top 10 Open Web ApplicationSecurity Project (OWASP) awareness program duringapplication development, and schedules regular internaland external audits to assess compliance with OWASP bestpractices.Cloud workload protection and monitoringWe use a combination of security services from third-partyvendors to protect Talend Data Fabric.Our security experts use external scanning tools to ensurethat systems and containers are hardened, configured, andpatched according to Talend guidelines and best practices.Our deployments leverage the built-in segmentationcapabilities of AWS EC2 Security groups or MicrosoftAzure Network Security groups to restrict inter-resourcecommunication.We use web application firewalls to inspect north/southand east/west traffic flows to our applications.Talend Data Fabric Security Architecture Overview

Our SOC monitors all security relevant events capturedin our SIEM.We leverage the built-in threat detectioncapabilities of AWS GuardDuty and Azure Advanced ThreatProtection to detect malicious activity and unauthorizedbehavior.Authentication, authorization, and access controlStandard accessTenant users are authenticated with their own uniquecredentials: username plus password.Talend issues X.509 public key certificates, which must beused to secure and encrypt all communications betweenuser systems and Talend Data Fabric. Talend Data Fabricsupports HTTPS over TLS.The authentication process follows the OpenID Connectstandard and uses either the authorization code or theimplicit flow. Once connected, a session is managed usingeither cookies or a JWT.Talend never accesses users’ credentials. Starting in 2020,Talend is progressively introducing a new identity managerbased on the Auth0 platform. Auth0 is a third-party serviceprovider that complies with Talend Security standards andcertifications. This migration is part of a global securitystrategy to better enable Talend to concentrate on ourcore domain by working with trusted third-party securityvendors.Within each operational region, Talend pairs withdedicated Auth0 private instances, ensuring bestperformance and compliance with local data sovereigntylaws.23Administrative accessTalend Data Fabric administrative access requiresmanagement review and approval. Elevated privilegeaccess requires the same level of approval bymanagement.Access to any management console, Talend Data Fabric,AWS, or Azure requires multifactor authentication(credentials plus secret keys).Access to the AWS console is restricted to select membersof the Talend Site Reliability Engineering (SRE) orInformation Security teams. New account creation followsa strict approval process. Accounts are reviewed quarterly.System access is provided via SSH private keys. Public keysare automatically deployed with the Talend configurationmanagement tool.Password managementTalend maintains a password management policy that allemployees must comply with. It ensures the creation ofstrong passwords, the protection of those passwords, anda reasonable frequency of password change.All system-level passwords (e.g., root, enable, applicationadministration accounts, etc.) must be changed on at leasta quarterly basis.All production system-level passwords must be part of theTalend IT administered secrets server.All user-level passwords (e.g., email, web, desktopcomputer, etc.) must be changed at least every threemonths.Talend Data Fabric Security Architecture Overview

Key managementVulnerability managementCurrently Talend Key Management is different on AWS andAzure. Soon keys for Talend on AWS will be managed likekeys for Talend on Azure.All applications are tested by Talend’s security experts(dynamic application security testing (DAST) andpenetration tests) at least twice a year.On AWSIn addition, Talend leverages internal and third-partysecurity services to perform external penetration tests.Talend relies on AWS-managed Customer Master Keys(CMK) for encryption. Talend uses its own AWS CMK togenerate unique Data Encryption Keys (DEK).Most DEKs are tenant-specific and are managed (includingrotation) by Talend. DEKs that do not need to be tenantspecific are managed via the AWS Encryption SDK.Front-end TLS endpoints are managed through the AWSCertificate Manager (ACM). The private key is generated byTalend and the associated certificate signed by Talend’sapproved Certificate Authority (CA), GoDaddy. Thecertificates are then published as part of the CertificateTranspare

Talend works with two general types of data: data that we collect and data that customers use with the software. Data that we collect Talend, across its cloud applications, collects only customer information that it need