Tableau Prep Below Decks

Transcription

Welcome

#TC18Tableau Prep: Below DecksDoug ThomaeStaff Software EngineerTableau

GoalThat you understand enough to start investigating the operation of TableauPrep on your own, if you want to.

AgendaAmerican CultureExecuting A Flow (Batch)Measuring CultureInteracting With A FlowOperationalizing and Scaling Tableau Prep

American Culture

What is CultureThe social behavior and norms found in human societiesIt tells us how we should behave and relate to other people and other culturesA “programmed” lens that affects how weinterpret events in our environmentWhat’s dangerous?What’s beneficial?How do we decide?Culture has a large influence on mass behavior but influences individuals in the culture to varying degrees

American “Nations”USA consists of 11 regional culturesBased on history. Outlined in dialect maps, genetic studyComplex mapping from culture to politics, religion and issuesSimilar ideas that came beforeWilber Zelinsky, “Doctrine of First Effective Settlement”Kevin Phillips, Emerging Republican Majority, 1969Joel Garreau, The Nine Nations of North America, 1981David Hackett Fisher, Albion’s Seed, 1989David Hackett Fisher, Champlain’s Dream, 2008Robert Cushing, The Big Sort, 2008 and others

“American Nations” vs. Genome MapHan, Carbonetto, Curtis, Wang, Granka, Byrnes, Noto, Kermany, Myres, Barber, Rand, Song, Roman, Battat,Elyashiv, Guturu, Hong, Chahine, Ball, “Clustering of 770,000 genomes reveals post-colonial populationstructure of North America”, Nature Communications 8, Article number 14238, 07 February 2017

Executing Batch Flows

The Beginning of the Beginning: Filtered Map

Tableau Prep Desktop is Web Client/ServerFront EndElectronTP Front End(Typescript, React,Redux)Back EndHTTPSTP Back End(Spring Collectionof Java Services)Electron’sembedded ChromeAQLTableau QueryPipeline ConnectorPlatformC StackPostgreSQL Server“Customer Database”

Tableau Prep Back End ServicesService NamePurposeCache Analysis ServiceCaching of data for interactive operationConnection ServicePresentation models for connection dialogs.Enables sharing of presentation models/dialogs with rest of Tableau productsFile ServiceStoring and saving .tfl/.tflx documents. Probably should be named “document service”Flow Executor ServiceEntry point for compiling and initiating flows runsFlow Operation ServiceManages binning and brushing during interactive operationFunction Def ServiceRetrieves Tableau function definitions from C stack.Enables sharing of functions/formulas with rest of Tableau products

Tableau Prep Back End ServicesService NamePurposeDesktop Integration ServiceReturns information about installed Tableau Desktop productsLicensing ServiceLicense validation and activationVersioning ServiceDocument versioning (in the “documents from different releases” sense of version)LoomDoc Validator ServiceAnalyzes/validates LoomDoc objectsNode Validator ServiceValidates single nodes by doing a front end compile and returning errors (or not)MRU Flow ServicePersists/retrieves the most recently used documents list

Tableau Prep Back End ServicesService NamePurposeStatus ServiceTracks/returns the status of flows that have been initiated by the Flow Executor Service.Telemetry ServiceGathers/sends telemetry (if the user has chosen that option).

What is a Tableau Prep Flow?Answer 1: It’s the graph displayed in the top paneAnswer 2: It’s a specification defined in loom-lang“loom-lang” is the language that captures flow definitionsIt’s only current textual form is in JSONAnswer 3: It’s a set of specifications for queriesEach node in a flow is a specification for a query (e.g. for a SQL database or in Hyper)When federation is involved, it may be multiple queries

Same Flow Graph, One Level DownContainer NodeInput NodeFilter off HIOutput Node

Flow Document/Loom-Lang{Connection id, unique within a flow“nodes”:{ see next page },“connections”: nnectionType”: “.v1.SqlConnection”,“isPackaged”: falseStandard fields for all connections“name”: “dthomae2.tsi.lan”,“connectionAttributes”: {“server”: “dthomae2.tsi.lan”,“dbname”: “tc18”,“port”: “5432”,“class”: postgres}}}}connectionAttributes differ by connection class

Flow Document/Loom-Lang, continued{Node id, unique within a 4f02bf”:{“nodeType”: “.v1.LoadSql”,“name”: “county to nation map”,“id”: d fields for all nodes“baseType”: “input”,“nextNodes”: ”: “Default”,“nextNodeId”: Namespace”: “Default”}], node type specific fields }},“connections”: { }}

Flow Document/Loom-Lang, 2caa214f02bf”:{“nodeType”: “.v1.LoadSql”,“baseType”: “input”,“nextNodes”: [{“nextNodeId”: “a77 ”,“nextNamespace”: 896686”:{“nodeType”: “.v1.Container”,“baseType”: “container”,“nextNodes”: [{“nextNodeId”: “132 ”,“nextNamespace”: “Default”}],“loomContainer”: {“nodes”: {“120daf25-3ae2-4f11-b83d-b5c87651edfd”: {“nodeType” : “.v1.RangeFilter”,“baseType”: “transform”,“nextNodes” : []}}, }}}

Loom-Lang, continuedAll nodes have:A type, which has a version component and a type nameA nameAn id which is unique within the flowA base type, one of input, output, transform, container, and supernode (another type of container) followed by node type specific fieldsEvery node input and output exists in a namespace:Namespaces are how Tableau Prep keeps duplicate column names straightSingle input/single output nodes use the “Default” namespaceJoin nodes have an incoming “Left” and “Right” namespaceGeneral multi-input nodes (e.g. Unions) generate guids as namespaces

Compilation and QueriesFront end compilerFlow executorServiceLoomengineError infoPrecompilationBuild nodeand type infoDatabase agnostic “logical” queryBack end compilerCreate nodeLogical andphysical modelsBuildexecutionplanAQLRunnerTableau data ndentSQL queryPostgreSQLdatabase

Logical Query for Our Flow logical-query selects field [stcou] /field .other fields /selects projectOp class \"logical-operator\" expressions binding name [stcou] identifierExp identifier \"[stcou]\" class \"logical-expression\"/ /binding other fields projectOp selectOp class \"logical-operator\" predicate funcallExp function \"!\" shape \"scalar\" class \"logical-expression\" funcallExp function \"&&\" shape \"scalar\" class \"logical-expression\" funcallExp function \" \" shape \"scalar\" class \"logical-expression\" identifierExp identifier \"[state]\" class \"logical-expression\"/ literalExp value \""HI"\" datatype \"string\" class \"logical-expression\"/ /funcallExp funcallExp function \"!\" shape \"scalar\" class \"logical-expression\" funcallExp function \"ISNULL\" shape \"scalar\" class \"logical-expression\" identifierExp identifier \"[state]\" class \"logical-expression\"/ /funcallExp /funcallExp /funcallExp /funcallExp /predicate table and field name information /logical-query

PostgreSQL " AS "stcou","e1b673e1-afa9-47aa-bbac-0b12dc"."county" AS "county","e1b673e1-afa9-47aa-bbac-0b12dc"."state" AS "state","e1b673e1-afa9-47aa-bbac-0b12dc"."nation" AS "nation“FROM "public"."county to nation map" "e1b673e1-afa9-47aa-bbac-0b12dc“WHERE (NOT (("e1b673e1-afa9-47aa-bbac-0b12dc"."state" 'HI’)AND (NOT ("e1b673e1-afa9-47aa-bbac-0b12dc"."state" IS NULL))))

We Use Hyper Under the Covers a LotLocal Files (e.g. .csv, .xls) are put into Hyper:Connector creates a table in Hyper and transfers data into itQueries are then generated for Hyper, just as if it was any other databaseHyper is used for federationFederation brings together data in one place to do cross database joinsHyper is the place where the data is brought togetherFederation for Tableau Prep is exactly the same as it is for other Tableau products

Handling Local Files, continuedSame data comes from .csv instead of PostgreSQL, Hyper sees:1) At ingestion time a table is created and data copied into itCREATE TABLE "TableauTemp"."CountyToNationMapUSA#csv" ("STCOU" BIGINT,"County" TEXT COLLATE "en US","State" TEXT COLLATE "en US","Nation\" TEXT COLLATE "en US")COPY "TableauTemp"."CountyToNationMapUSA#csv" ("STCOU", "County", "State", "Nation")FROM STDIN WITH (FORMAT HYPERBINARY, SANITIZE)2) Later, when the query happensSELECT "-1384900078"."STCOU" AS "STCOU“,"-1384900078".\"County\" AS \"County\","-1384900078"."State" AS "State","-1384900078"."Nation" AS "Nation”FROM "TableauTemp"."CountyToNationMapUSA#csv" "-1384900078“WHERE (NOT (("-1384900078"."State" 'HI') AND (NOT ("-1384900078"."State" IS NULL)))

Measuring Culture

Hofstede’s Cultural DimensionsGeert Hofstede devised a set of dimensionsthat can be used to compare cultures:Power Distance—degree of acceptance of unequal powerIndividualism vs. collectivism—degree of integration into groupsUncertainty avoidance—a society’s tolerance for things outside the status quoMasculinity vs. femininity—degree of preference for achievement, heroism, assertivenessLong-term orientation—degree to which a society is able/willing to adapt or changeIndulgence vs. restraint—degree to which behavior is controlled by social norms

Hofestede’s Dimensions Can Be MeasuredHigh power distance:Greater income inequalitySmaller middle classDictatorships or oligarchiesViolence in national politicsPolitical systems changed by revolutionBusiness executives olderInnovations only when supported by hierarchyLow power distance:Smaller income inequalityLarger middle classSeparation of powersPeaceful political conflict resolutionPolitical systems changed by evolutionBusiness executives youngerSpontaneous innovationsNote that these are measures that allowcultures to be compared, not absolute indices

Gini CoefficientThe Gini coefficient measures incomeinequality:

Interacting With Flows

Prepping ACS Gini Data

BinningBinning produces the vertical “bar chart”of values in the profile pane:The Flow Operation service is called to generate the binned valuesIt uses the Flow Executor service to generate the actual queriesBinning uses various “bin strategies”to decide how to do the actual binning:A bin strategy decides how to select the values/ranges that will be shown to the userWhen the user clicks on the node a bin strategy is chosen based on the type of the columnThe final binning operation is a count of something, although continuousranges need to be partitioned first

Walking Through the Gini Flow (Binning)

The Hyper ViewSELECT COUNT(1) AS Measure,Hyper uses PostgreSQL,conventions, this is sameas count(*)t0.Dimension AS DimensionFROM (SELECT hyper.GEO.id AS GEO.id,hyper.HD02 VD01 AS HD02 VD01,hyper.GEO.display-label AS GEO.display-label,hyper.HD01 VD01 AS HD01 VD01,hyper.File Paths AS File Paths,hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,hyper.GEO.id2 AS GEO.id2,hyper.HD02 VD01 AS DimensionFROM FO7bI -Default hyperLIMIT 1048576) t0WHERE ((NOT (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 IS NULL))Interactive opswill always limitto default 1M rowsTP issues one queryper columns, the onespecified “as Dimension”changes in each oneThis has to do withpaging data to the UIAND (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 0)AND ((t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 IS NULL)OR (t0.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 8868)))GROUP BY 2ORDER BY Dimension ASC NULLS FIRSTNull is always sortedto the top

BrushingBrushing is binning with a condition:The user picks the condition by clicking on a valueBrushing and binning are both performed by “analyzers”. Binners are a special case analyzer.There are null values for the Gini coefficient values. Where do they come from?

Walking Through the Gini Flow (Brushing)

The Hyper ViewThe query generated when Gini Coefficient null value was selectedSELECT COUNT(1) AS Measure,t0.Dimension AS DimensionFROM (It’s still the old name –SELECT hyper.GEO.id AS GEO.id,the column name is stillhyper.HD02 VD01 AS HD02 VD01,the same at the db leveland TP knows thathyper.GEO.display-label AS GEO.display-label,hyper.HD01 VD01 AS HD01 VD01,hyper.File Paths AS File Paths,hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,hyper.GEO.id2 AS GEO.id2,hyper.HD02 VD01 AS DimensionFROM FO7bI -Default hyperLIMIT 1048576It’s the binning query) t0with a condition addedWHERE (t0.HD01 VD01 IS NULL)GROUP BY 2ORDER BY Dimension ASC NULLS FIRST

The Hyper ViewAfter the exclusion of Geography is added to the recipe the binning query looks like:SELECT t0.GEO.id AS GEO.id,t0.File Paths AS File Paths,t0.GEO.id2 AS GEO.id2,t0.Gini Coefficient Error AS Gini Coefficient Error,t0.Gini Coefficient AS Gini Coefficient,The query handlesthe column renamet0.GEO.display-label AS d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82FROM (SELECT hyper.GEO.id AS GEO.id,hyper.HD02 VD01 AS HD02 VD01,hyper.GEO.display-label AS GEO.display-label,hyper.HD01 VD01 AS HD01 VD01,hyper.File Paths AS File Paths,hyper.c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82 AS c0f2a6a2-fc5e-4978-bdaf-87d54d3fba82,hyper.GEO.id2 AS GEO.id2,hyper.HD02 VD01 AS Gini Coefficient Error,hyper.HD01 VD01 AS Gini CoefficientFROM FO7bI -Default hyperLIMIT 1048576) t0WHERE (NOT ((t0.GEO.display-label 'Geography') AND (NOT (t0.GEO.display-label IS NULL))))LIMIT 1000Conditions from therecipe are includedin future binning andbrushing queries

Log File Locations“My Tableau Prep Repository”“Logs” directoryhyperd.log – the operations (including queries) as seen by Hyperlog.txt – the operations (including queries) that are sent by Tableau PrepDownload Tableau Log erExercise For the Interested:What query is sent to get “what’s in/what’s out” data in a join node?Hint: Search for “FULL” in hyperd.log in Tableau Log Viewer or text editor

Getting Started With the Exercise

Walking Through the Gini Flow (Join)

How Do Gini Coefficients Compare?

Operationalizing andScaling Tableau Prep

Tableau Prep Conductor – V1Add Tableau Prep Capabilities to Tableau Server:Use the same Flow Executor Service used by Tableau Prep DesktopUse the same Versioning Service used by Tableau Prep DesktopAdd:Flow Orchestrator Service – to set up connections needed by Flow Executor ServiceFlow Publishing Service – API to publish flowsFlow Service – API for UI to retrieve flow inputs/outputs, decrypt credentials and other functionsExtend/Using Existing Tableau Server Mechanism:Job type to schedule flowsBackgrounder support for running flowsSecure credential storageEnforcement of permissionsExtension of content types (e.g. data sources, workbooks, flows)Extension of administrative views

Tableau Prep Conductor – Post V1Enable Web Authoring:Port most remaining servicesPort existing web UIServer version of Hyper cachingImprove Scheduling and Resource Management:Part of larger data platform effortsTrigger flow runs when inputs are updated

Scaling to Larger DatasetsOutput To Database:Tableau Prep currently outputs to local files or data sources (hyper, csv)For large datasets move computation to data:- Generate a query using existing mechanisms- Wrap it in an upsert, send to database. Tableau systems never handle data at all in batch runs.Augmenting Data Warehouse/Lake With Local Data:Don’t pull down the big dataset to federate with the local dataPush the smaller, local data to a temp table to work with larger datasetIncremental Update and Query:Large Data Warehouse/Lake datasets are built one hourly/daily/weekly/etc. update at a timeParameterized Data PullsIncremental Upserts

Finishing Up

What Should I Remember?Tableau Prep flows are specifications that get turned into queriesTableau Prep Desktop is actually a client/server systemTableau Prep is built on top of the Tableau data platformTableau Prep is architected to scale although many of the mechanisms aren’t built out yet

S E S S I O N R E P E AT STableau Prep Below DecksTue 10/23 2:15 – 3:15 MCCNO – L2 - 297Wed 10/24 12:00 – 1:00 MCCNO – L2 - 263

R E L AT E D S E S S I O N SPreparing Your Data the Tableau Prep WayThu 10/25 12:30 – 1:30 MCCNO – L3 - 388How Aggregate Friends and Influence PivotsWed 10/24 3:30 – 4:30 MCCNO – L2 – New Orleans Theater A

Please complete thesession survey from theSession Details screenin your TC18 app

#TC18Thank you!Douglas Thomaedthomae@tableau.com

.tfl/.tflx FilesBoth are always in zip format:Some older output files from Tableau Alpha were JSON filesThe x on .tflx files is a hint that they contain data files, but has no other significanceYou can open them up with a standard zip utility.They’re not encrypted and will never contain secrets like passwordsThe content is segmented by zip stream:maestroMetadata – other stream names, “document versions”displaySettings – data or config that affects the way things are displayed (e.g. column order)flow – the flow definition in loom-langdata files (streams named using a guid to avoid name collision)

Tableau Prep Desktop is Web Client/Server Electron's embedded Chrome Electron TP Front End (Typescript, React, Redux) TP Back End (Spring Collection of Java Services) Tableau Query Pipeline Connector Platform HTTPS AQL PostgreSQL Server "ustomer Database" Front End Back End C Stack. Tableau Prep Back End Services Service Name Purpose Cache Analysis Service Caching of data for .