Polybase In Action

Transcription

Polybase In ActionKevin FeaselEngineering Manager, Predictive AnalyticsChannelAdvisor#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Who Am I? What Am I Doing Here?Catallaxy ServicesCurated SQLWe Speak Linux@feaselkl#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

PolybasePolybase is Microsoft's newest technology for connecting toremote servers.It started by letting you connect to Hadoop and has expandedsince then to include Azure Blob Storage. Polybase is also thebest method to load data into Azure SQL Data Warehouse.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Polybase Targets SQL Server to Hadoop (Hortonworks or Cloudera, on-premor IaaS) SQL Server to Azure Blob Storage Azure Blob Storage to Azure SQL Data WarehouseIn all three cases, you can use the T-SQL you know rather thana similar SQL-like language (e.g., HiveQL, SparkSQL, etc.) orsome completely different language.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Polybase Targets – SQL Server 2019 SQL Server to SQL ServerSQL Server to OracleSQL Server to MongoDBSQL Server to TeradataSQL Server to ODBC (e.g., Spark)#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Massive Parallel ProcessingPolybase extends the idea of Massively Parallel Processing(MPP) to SQL Server. SQL Server is a classic "scale-up"technology: if you want more power, add moreRAM/CPUs/resources to the single server.Hadoop is a great example of an MPP system: if you wantmore power, add more servers; the system will coordinateprocessing.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Why MPP?It is cheaper to scale out than scale up: 10 systems with 256GB of RAM and 8 cores is a lot cheaper than a system with 2.5TB of RAM and 80 cores.At the limit, you eventually run out of room to scale up, butscale out is much more practical: you can scale out to 2petabytes of RAM but good luck finding a single server thatsupports this amount!There is additional complexity involved, but MPP systems letyou move beyond the power of a single server.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Polybase As MPPMPP requires a head node and 1 or more compute nodes.Polybase lets you use SQL Servers as the head and computenodes. Scale-out servers must be on an Active Directorydomain. The head node must be Enterprise Edition, thoughthe compute nodes can be Standard Edition.Polybase lets SQL Server compute nodes talk directly toHadoop data nodes, perform aggregations, and then returnresults to the head node. This removes the classic SQL Serversingle point of contention.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Timeline Introduced in SQL Server Parallel Data Warehouse (PDW)edition, back in 2010 Expanded in SQL Server Analytics Platform System (APS) in2012. Released to the "general public" in SQL Server 2016, withmost support being in Enterprise Edition. Extended support for additional technologies (like Oracle,MongoDB, etc.) will be available in SQL Server 2019.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

MotivationToday's talk will focus on using Polybase to integrate SQLServer 2016/2017 with Hadoop and Azure Blob Storage.We will use a couple smaller data sources to give you an ideaof how Polybase works. Despite the size of the demos,Polybase works best with a significant number of computenodes and Hadoop works best with a significant number ofdata nodes.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Installation Pre-Requisites1. SQL Server 2016 or later, Enterprise or Developer Edition2. Java Runtime Environment 7 Update 51 or later (get thelatest version of 8 or 9; using JRE 9 requires SQL Server2017 CU4)3. Machines must be on a domain if you want to use scaleout4. Polybase may only be installed once per server. If you havemultiple instances, choose one. You can enable on multipleVMs, however.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

InstallationSelect the “New SQLServer stand-aloneinstallation” link in theSQL Server installer:#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

InstallationWhen you get to featureselection, check the“PolyBase Query Servicefor External Data” box:#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

InstallationIf you get the followingerror, you didn’t installthe Java RuntimeEnvironment.If you have JRE 9, youneed SQL Server 2017CU4 or later for SQLServer to recognize this.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

InstallationFor standaloneinstallation, select thefirst radio button. Thisselection does notrequire your machine beconnected to a domain.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

InstallationThe Polybase engine anddata movement serviceaccounts are NETWORKSERVICE accounts bydefault. There are novirtual accounts forPolybase.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

InstallationAfter installation is complete, run the following against the SQLServer instance:sp configure @configname 'hadoop connectivity',@configvalue 7;GORECONFIGUREGOSet the value to 6 for Cloudera’s Hadoop distribution, or 7 forHortonworks or Azure Blob Storage.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Hadoop ConfigurationFirst, we need to make sure our Hadoop and SQL Serverconfiguration settings are in sync.We need to modify the yarn-site.xml and mapredsite.xml configuration files.If you do not do this correctly, then MapReduce jobs will fail!#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Hadoop ConfigurationYou will need to find your Hadoop configuration folder thatcame as part of the Polybase installation. By default, that is at:C:\Program Files\Microsoft Hadoop\confInside this folder, there are two files we care about.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Hadoop ConfigurationNext, go looking for your Hadoop installation directory. OnHDP, you'll find it at:/usr/hdp/[version]/hadoop/conf/Note that the Polybase docs use /usr/hdp/current, but this isa bunch of symlinks with the wrong directory structure.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Hadoop ConfigurationModify yarn-site.xml and change theyarn.application.classpath property. For theHortonworks distribution of Hadoop (HDP), you’ll see a seriesof values like: value hadoop/lib/*, /value Replace 2.4.3.0-227 with your HDP version.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Hadoop ConfigurationInclude the following snippet in your mapred-site.xml file: property name yarn.app.mapreduce.am.staging-dir /name value /user /value /property property name mapreduce.jobhistory.done-dir /name value /mr-history/done /value /property property name mapreduce.jobhistory.intermediate-done-dir /name value /mr-history/tmp /value /property Without this configured, you will be unable to performMapReduce operations on Hadoop.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Polybase BasicsIn this section, we will look at three new constructs thatPolybase introduces: external data sources, external fileformats, and external tables.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External Data SourceExternal data sources allow you to point to another system.There are several external data sources, and we will look attwo today.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External Data SourceCREATE EXTERNAL DATA SOURCE [HDP] WITH(TYPE HADOOP,LOCATION N'hdfs://sandbox.hortonworks.com:8020',RESOURCE MANAGER LOCATION N'sandbox.hortonworks.com:8050')The LOCATION is the NameNode port and is needed forHadoop filesystem operations.RESOURCE MANAGER LOCATION is the YARN port and isneeded for predicate pushdown.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External File FormatExternal file formats explain the structure of a data set. Thereare several file formats available to us.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External File Format: Delimited FileDelimited files are the simplest to understand but tend to bethe least efficient.CREATE EXTERNAL FILE FORMAT file format nameWITH (FORMAT TYPE DELIMITEDTEXT[ , FORMAT OPTIONS ( format options [ ,.n ] ) ][ , DATA COMPRESSION {'org.apache.hadoop.io.compress.GzipCodec' TDEVCONNECTIONS ITDEVCONNECTIONS.COM

External File Format: Delimited File format options :: {FIELD TERMINATOR field terminator STRING DELIMITER string delimiter DATE FORMAT datetime format USE TYPE DEFAULT { TRUE FALSE }} /format options #ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External File Format: RCFileRecord Columnar files are an early form of columnar storage.CREATE EXTERNAL FILE FORMAT file format nameWITH (FORMAT TYPE RCFILE,SERDE METHOD yColumnarSerDe' rDe'}[ , DATA COMPRESSION 'org.apache.hadoop.io.compress.DefaultCodec' ]);#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External File Format: ORCOptimized Row Columnar files are strictly superior to RCFile.CREATE EXTERNAL FILE FORMAT file format nameWITH (FORMAT TYPE ORC[ , DATA COMPRESSION {'org.apache.hadoop.io.compress.SnappyCodec' DEVCONNECTIONS ITDEVCONNECTIONS.COM}

External File Format: ParquetParquet files are also columnar. Cloudera prefers Parquet,whereas Hortonworks prefers ORC.CREATE EXTERNAL FILE FORMAT file format nameWITH (FORMAT TYPE PARQUET[ , DATA COMPRESSION {'org.apache.hadoop.io.compress.SnappyCodec' CONNECTIONS ITDEVCONNECTIONS.COM}

External File Format: ComparisonMethodGoodBadBest UsesDelimitedEasy to useLess efficient, slowerperformanceEasy ModeRC FileColumnarStrictly superior optionsDon’t use thisORCGreat agg perfColumnar not always agood fit; slower to writeNon-nested files withaggregations of subsetsof columnsParquetGreat agg perfColumnar not always agood fit; often largerthan ORCNested data#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External TablesExternal tables use external data sources and external fileformats to point to some external resource and visualize it asa table.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External TablesCREATE EXTERNAL TABLE [dbo].[SecondBasemen]([FirstName] [VARCHAR](50) NULL,[LastName] [VARCHAR](50) NULL,[Age] [INT] NULL,[Throws] [VARCHAR](5) NULL,[Bats] [VARCHAR](5) NULL)WITH(DATA SOURCE [HDP],LOCATION N'/tmp/ootp/secondbasemen.csv',FILE FORMAT [TextFileFormat],REJECT TYPE VALUE,REJECT VALUE 5);#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

External TablesExternal tables appear to end users just like normal tables:they have a two-part schema and even show up inManagement Studio, though in an External Tables folder.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Demo Time#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Querying HadoopOnce we have created an external table, we can write queriesagainst it just like any other table.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Demo Time#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Querying Hadoop – MapReduceIn order for us to be able to perform a MapReduce operation, weneed the external data source to be set up with a resource manager.We also need one of the two:1.The internal cost must be high enough (based on external tablestatistics) to run a MapReduce job.2. We force a MapReduce job by using the OPTION(FORCEEXTERNALPUSHDOWN) query hint.Note that there is no "cost threshold for MapReduce," so the nonforced decision is entirely under the Polybase engine's control.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Querying Hadoop – MapReduceFunctionally, MapReduce queries operate the same as basicqueries. Aside from the query hint, there is no special syntaxfor MapReduce operations and end users don't need to thinkabout it.WARNING: if you are playing along at home, your Hadoopsandbox should have at least 12 GB of RAM allocated to it.This is because Polybase creates several 1.5 GB containers ontop of memory requirements for other Hadoop services.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Demo Time#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Querying Hadoop – StatisticsAlthough external tables have none of their data stored onSQL Server, the database optimizer can still make smartdecisions by using statistics.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Querying Hadoop – StatisticsImportant notes regarding statistics:1.2.3.4.Stats are not auto-created.Stats are not auto-updated.The only way to update stats is to drop and re-create thestats.SQL Server generates stats by bringing the data over, soyou must have enough disk space! If you sample, you onlyneed to bring that percent of rows down.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Querying Hadoop – StatisticsStatistics are stored in the same location as any other table'sstatistics, and the optimizer uses them the same way.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Demo Time#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Querying Hadoop – Data InsertionNot only can we select data from Hadoop, we can also writedata to Hadoop.We are limited to INSERT operations. We cannot update ordelete data using Polybase.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Demo Time#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Azure Blob StorageHadoop is not the only data source we can integrate withusing Polybase. We can also insert and read data in Azure BlobStorage.The basic constructs of external data source, external fileformat, and external table are the same, though some of theoptions are different.#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Azure Blob StorageCreate an external data source along with a scoped databasecredential (for secure access to this blob):CREATE MASTER KEY ENCRYPTION BY PASSWORD '{password}';GOCREATE DATABASE SCOPED CREDENTIAL AzureStorageCredentialWITH IDENTITY 'cspolybase',SECRET '{access key}';GOCREATE EXTERNAL DATA SOURCE WASBFlightsWITH (TYPE HADOOP,LOCATION t',CREDENTIAL AzureStorageCredential);#ITDEVCONNECTIONS ITDEVCONNECTIONS.COM

Azure Blob St

Installation. After installation is complete, run the following against the SQL Server instance: sp_configure @configname 'hadoop connectivity', @configvalue 7; GO RECONFIGURE GO. Set the value to 6 for Cloudera’s Hadoop distribution, or 7 for Hortonworks or Azure Blob Storage.