Transcription
Talend DatabricksJune 2020
Databricks Pillasrs
Serverless Spark Clusters withAutoscalingTransient or Interactive ClustersServerless SparkData scientist friendly Notebooksfor SQL, Scala, Spark, PythonEmbedded Support for Libs onSpark Tensorflow, Pytorch, Sklearn
VALUE OF SERVERLESS BIG DATANO SERVERSCALESAVETO PROVISION/MANAGEWITH USAGEWHEN IDLINGSERVERLESS BRINGS EASE OFADMINISTRATIONGET ELASTICITY AT RUNTIMESCALE UP/DOWN ON YOUR NEEDSNEVER PAY FOR IDLESERVERS GET AUTO-TERMINATED4
Databricks Unified Analytics PlatformDatabricks WorkspaceCollaborative Notebooks, Production JobsDatabricks RuntimeML FrameworksTransactionsIndexingDatabricks Cloud Service
Talend Databricks Spark SupportStudioServerless Spark AWS & AzureInteractive & Transient auto scale clustersPoolsTriggering Notebooks with APIDBFS ComponentsPipeline Designer AWS & Azure Interactive & Transient auto scale clusters
A New Standard for Building Data LakesOpen Format Based on ParquetWith TransactionsApache Spark API’s
Why Parquet?WHY PARQUET? Columnar storage suitable forwide tables and analytics. Optimized for – Write onceread many Saves Space: Columnar Layoutcompresses better Enables better scans: Loadsonly the columns that needs tobe scanned (Advantage:Projection & PredicatePushdown) Schema Evolution: Partiallysupported8
Parquet – pros and consPROS (vs Text)CONSGood for distributed processingUpdates/Delta – FULL ReadWrite - ExpensiveGreat for Wide tablesLimited Schema Evolution(Append columns)Predicate Push and projectionpushdown (Basic behavior partitioning )Historization / Versioning requires manual partitioningGood compressionQuery optimization –folderstructure / partitioning / resize9
Delta Lake ensures data reliabilityBatchParquet FilesHigh Quality & Reliable Dataalways ready for analyticsStreamingUpdates/DeletesKey FeaturesTransactionalLog ACID TransactionsSchema Enforcement Unified Batch & StreamingTime Travel/Data Snapshots
Delta Lake optimizes performanceDatabricksoptimized engineHighly Performantqueries at scaleParquet FilesTransactionalLogKey Features IndexingCompaction Data skippingCaching
DATABRICKS DELTASupport conditional updates (ELT)12
Talend Databricks Delta Lake SupportStudioDatabricksoptimized engineParquet Files Delta Files Input & Output Spark batch Integration with the time travelcapability of Delta Lake Partitioning options Delta Tables Create / Append / MergeStitch Data LoaderTransactionalLog Destination AWS
Why Talend & Databricks?Data ScientistLikesPython, neural network, NLPWants toBuild accurate ML Models80% of the time Finding, cleaning, and reorganizing hugeamounts of data
Data EngineerTalks aboutKafka, Parquet & KubernetesWants toBuild repeatable, durable, data pipelinesto integrate data and provide Data quality15
Movies data set demo
Talend & Spark on DatabricksGet Data Scientist to Insights Faster18
Movies data AnalyticsDemo Use Case DesignS3 / ADLS / Blob / DBFScreditscleanse / parsejoin /transformaggregate / filterS3 / ADLS / Blob / DBFSmovies ratings19
Take away Talend with Databricks help the cooperation between Data Engineers andData ScientistServerless Spark support for Studio and Pipeline designerDelta Lake integration Tables UpdatesTransactions supportTime travelQuery optimization
Talend with Databricks help the cooperation between Data Engineers and Data Scientist Serverless Spark support for Studio and Pipeline designer Delta Lake integration Tables Updates