Talend Databricks

Transcription

Talend DatabricksJune 2020

Databricks Pillasrs

Serverless Spark Clusters withAutoscalingTransient or Interactive ClustersServerless SparkData scientist friendly Notebooksfor SQL, Scala, Spark, PythonEmbedded Support for Libs onSpark Tensorflow, Pytorch, Sklearn

VALUE OF SERVERLESS BIG DATANO SERVERSCALESAVETO PROVISION/MANAGEWITH USAGEWHEN IDLINGSERVERLESS BRINGS EASE OFADMINISTRATIONGET ELASTICITY AT RUNTIMESCALE UP/DOWN ON YOUR NEEDSNEVER PAY FOR IDLESERVERS GET AUTO-TERMINATED4

Databricks Unified Analytics PlatformDatabricks WorkspaceCollaborative Notebooks, Production JobsDatabricks RuntimeML FrameworksTransactionsIndexingDatabricks Cloud Service

Talend Databricks Spark SupportStudioServerless Spark AWS & AzureInteractive & Transient auto scale clustersPoolsTriggering Notebooks with APIDBFS ComponentsPipeline Designer AWS & Azure Interactive & Transient auto scale clusters

A New Standard for Building Data LakesOpen Format Based on ParquetWith TransactionsApache Spark API’s

Why Parquet?WHY PARQUET? Columnar storage suitable forwide tables and analytics. Optimized for – Write onceread many Saves Space: Columnar Layoutcompresses better Enables better scans: Loadsonly the columns that needs tobe scanned (Advantage:Projection & PredicatePushdown) Schema Evolution: Partiallysupported8

Parquet – pros and consPROS (vs Text)CONSGood for distributed processingUpdates/Delta – FULL ReadWrite - ExpensiveGreat for Wide tablesLimited Schema Evolution(Append columns)Predicate Push and projectionpushdown (Basic behavior partitioning )Historization / Versioning requires manual partitioningGood compressionQuery optimization –folderstructure / partitioning / resize9

Delta Lake ensures data reliabilityBatchParquet FilesHigh Quality & Reliable Dataalways ready for analyticsStreamingUpdates/DeletesKey FeaturesTransactionalLog ACID TransactionsSchema Enforcement Unified Batch & StreamingTime Travel/Data Snapshots

Delta Lake optimizes performanceDatabricksoptimized engineHighly Performantqueries at scaleParquet FilesTransactionalLogKey Features IndexingCompaction Data skippingCaching

DATABRICKS DELTASupport conditional updates (ELT)12

Talend Databricks Delta Lake SupportStudioDatabricksoptimized engineParquet Files Delta Files Input & Output Spark batch Integration with the time travelcapability of Delta Lake Partitioning options Delta Tables Create / Append / MergeStitch Data LoaderTransactionalLog Destination AWS

Why Talend & Databricks?Data ScientistLikesPython, neural network, NLPWants toBuild accurate ML Models80% of the time Finding, cleaning, and reorganizing hugeamounts of data

Data EngineerTalks aboutKafka, Parquet & KubernetesWants toBuild repeatable, durable, data pipelinesto integrate data and provide Data quality15

Movies data set demo

Talend & Spark on DatabricksGet Data Scientist to Insights Faster18

Movies data AnalyticsDemo Use Case DesignS3 / ADLS / Blob / DBFScreditscleanse / parsejoin /transformaggregate / filterS3 / ADLS / Blob / DBFSmovies ratings19

Take away Talend with Databricks help the cooperation between Data Engineers andData ScientistServerless Spark support for Studio and Pipeline designerDelta Lake integration Tables UpdatesTransactions supportTime travelQuery optimization

Talend with Databricks help the cooperation between Data Engineers and Data Scientist Serverless Spark support for Studio and Pipeline designer Delta Lake integration Tables Updates