How To Analyze JSON With SQL - Snowflake Data Cloud PDF Free Download

1y ago

18 Views

1 Downloads

2.43 MB

16 Pages

Report/dmca

Download PDF

Transcription

HOW TO ANALYZE JSON WITH SQLSchema-on-Read made easyAuthor: Kent Graziano

2Semi-Structured Brings New Insights to Business3Schema? No Need!4How Snowflake Solved This Problem5Enough Theory. Let’s Get Started.7A More Complex Data Load8How to Handle Arrays of Data10How to Handle Multiple Arrays12Aggregations13Filtering Your Data14Schema-on-Read Is a Reality15About Snowflake

If you’re an experienced data architect, dataengineer, or data analyst, you’ve probablybeen exposed to semi-structured data suchas JSON. IoT devices, social media sites, andmobile devices all generate endless streamsof JSON log files. Handling JSON data isunavoidable, but it can’t be managed the sameway as the more familiar structured data. Yet,to thrive in today’s world of data, knowing howto manage and derive value from this form ofCHAMPION GUIDESSEMI-STRUCTURED BRINGSNEW INSIGHTS TO BUSINESSsemi-structured data is crucial to deliveringvaluable insights to your organization. Oneof the key differentiators in Snowflake CloudData Platform is the ability to natively ingestsemi-structured data such as JSON, store itefficiently, and then access it quickly usingsimple extensions to standard SQL. This ebookwill give you a modern approach to produceanalytics easily and affordably from JSON datausing SQL.2

CHAMPION GUIDESSCHEMA?NO NEED!Load your semi-structured datadirectly into a relational tableOver the last several years, we have all heard thephrase “Schema-on-Read” to explain the benefit ofloading semi-structured data, such as JSON, intoa NoSQL platform such as Hadoop. The idea here:Data modeling and schema design could be delayeduntil long after you loaded the data. Delaying thesetasks avoids slowing down getting the data intoa repository because you had to wait for a datamodeler to first design the tables.Schema-on-Read implies there is a knowableschema. So, even though organizations can quicklyload semi-structured data into Hadoop or a NoSQLplatform, there is still more work required to actuallyparse the data into an understandable schemabefore it can be analyzed with a standard SQL-basedtool. Experienced data professionals often havethe burden of determining the schema and writingcode to extract the data. Unlike structured data ina relational database, this requirement impedes anorganization’s ability to access and utilize semistructured data in a timely manner.INSTANTLY QUERY SEMI-STRUCTURED DATAWITH SNOWFLAKEWith Snowflake, you can load your semi-structureddata directly into a relational table. Then, youcan query that data with a SQL statement andjoin it to other structured data, while not frettingabout future changes to the “schema” of that data.Snowflake keeps track of the self-describing schemaso you don’t have to; no ETL or fancy parsingalgorithms are required.The built-in support to load and query semistructured data—including JSON, XML, and AVRO—is one of the remarkable benefits of Snowflake.With most of today’s big data environments andtraditional, on-premises and cloud-washed datawarehouses, you have to first load this type of datato a Hadoop or NoSQL platform. Then you need toparse it, for example, with MapReduce, in order toload it into columns in a relational database. Then,and only then, can you visualize or transform thatdata using a BI/analytics tool or a Notebook/datascience tool. All of this means more time, money,and headache for you to allow business users tosee that data.The idea here:Data modeling and schema designcould be delayed until long afteryou loaded the data. Delayingthese tasks avoids slowing downgetting the data into a repositorybecause you had to wait for a datamodeler to first design the tables.3

CHAMPION GUIDESHOW SNOWFLAKESOLVED THIS PROBLEMIt’s simple. Snowflake has a new data typecalled VARIANT that allows semi-structureddata to be loaded, as is, into a column in arelational table.When Snowflake loads semi-structured data,it optimizes how it stores that data internallyby automatically discovering the attributes andstructure that exist in the data, and using thatknowledge to optimize how the data is stored.Snowflake also looks for repeated attributesacross records, organizing and storing thoserepeated attributes separately. This enablesbetter compression and faster access, similarto the way that a columnar database optimizesstorage of columns of data.DATA IN, INSIGHT OUTBut that’s only half the equation. Once the data isin, how do you get the insight out? Snowflake hasextensions to SQL to reference the internal schema ofthe data. Because the data is self-describing, you canquery the components and join the data to columnsin other tables, as if you already parsed the data intoa standard relational format, except no coding, ETL, orother parsing is required to prep the data.In addition, statistics about the subcolumns arealso collected, calculated, and stored in Snowflake’smetadata repository. This gives Snowflake’s advancedquery optimizer metadata about the semi-structureddata, to optimize access to it. The collected statisticsallow the optimizer to use pruning to minimize theamount of data needed for access, thus speeding thereturn of data.DATA WAREHOUSING AND ANALYTICS,REIMAGINED FOR THE CLOUDNo other on-premises or cloud-washed solutionoffers Snowflake’s optimized level of support forprocessing semi-structured data. Even though sometraditional vendors have added features to store andaccess JSON and XML, those are add-ons to legacycode, using existing data types such as character largeobjects (CLOBs), and they are not natively optimized.With these solutions, getting any kind of performanceoptimization requires additional performance tuningby DBAs. For example, in its documentation, one ofthe newer, cloud-washed data warehouse providersstates that customers should not try to use theirJSON feature at scale. This is yet another example ofhow cloud-washed legacy code can’t magically solvedata problems.The upshot: No Hadoop or NoSQL is neededin your enterprise data landscape for the solepurpose of holding semi-structured data. Theresult is a modern data platform that usesSQL, which you and your staff already knowhow to write. And as the data source evolvesand changes over time with new attributes,nesting, or arrays, there’s no need to redo ETLor ELT code. The VARIANT data type does notcare if the schema varies.4

CHAMPION GUIDESENOUGH THEORY.LET’S GET STARTED.How you can load semi-structured data directly into Snowflake1. CREATE A TABLEI already have a Snowflake account, a database, and a multi-cluster warehouse setup, so just like I would in any other database, I simply issue a DDL statement tocreate a table:create or replace table json demo (v variant);Now I have a table with one column (“v”) with a declared data type of VARIANT.2. LOAD SOME DATANow I load a sample JSON document using an INSERT and Snowflake’s PARSEJSON function. We’re not simply loading the document as text but rather storing itas an object in the VARIANT data type, while at the same time converting it to anoptimized columnar format (to query later):insert into json demoselectparse json('{"fullName": "Johnny Appleseed","age": 42,"gender": "Male","phoneNumber": {"areaCode": "415","subscriberNumber": "5551234"},"children": [{ "name": "Jayden", "gender": "Male", "age": "10" },{ "name": "Emma", "gender": "Female", "age": "8" },{ "name": "Madelyn", "gender": "Female", "age": "6" }],"citiesLived": [{ "cityName": "London","yearsLived": [ "1989", "1993", "1998", "2002" ]},{ "cityName": "San Francisco","yearsLived": [ "1990", "1993", "1998", "2008" ]},{ "cityName": "Portland","yearsLived": [ "1993", "1998", "2003", "2005" ]},{ "cityName": "Austin","yearsLived": [ "1973", "1998", "2001", "2005" ]}]}');5

copy into myjsontablefrom @my json stage/tutorials/dataloading/contacts.jsonon error 'skip file';4. CASTE THE DATAUsually we don’t want to see the double quotes around the data in the report outputunless we’re going to create an extract file. Instead, we can format it as a string andgive it a nicer column alias, similar to what we would do with a normal column:CHAMPION GUIDESWhile this approach is useful for testing, normally JSON would be loaded into aSnowflake table from your Snowflake staging area using a simple COPY command.select v:fullName::string as full namefrom json demo;For more details on the many options and features of the COPY command, seeSnowflake’s data loading tutorial.1 row producedrow#FULL NAME3. START PULLING DATA OUT1Johnny AppleseedNow that we’ve loaded an example, let’s work through how we access it. We’ll startwith just getting the data from the NAME subcolumn:Next, let’s look at a bit more of the data using the same syntax from above:select v:fullName from json demo;1 row producedrow#V:FULLNAME1"Johnny Appleseed"selectv:fullName::string as full name,v:age::int as age,v:gender::string as genderfrom json demo;1 row producedWhere:v the column name in the json demo table (from our create table command)fullName attribute in the JSON schemarow#FULL NAMEAGEGENDER1Johnny Appleseed42Malev:fullName notation to indicate which attribute in column “v” we want to selectSimilar to the table.column notation all SQL people are familiar with, Snowflakehas the ability to effectively specify a column within the column––a subcolumn.However, we cannot use the dot notation for our separator, because SQL syntaxhas already claimed that. So, the Snowflake team chose the next obvious thing: acolon to reference the JSON subcolumns and navigate that hierarchy. This structuralinformation is dynamically derived based on the schema definition embedded in theJSON string. Snowflake’s advanced metadata engine records this information at thetime it loads the JSON document into the table.Again, we use simple SQL and the output is similar to the results from any table youmight have built in a traditional data warehouse.At this point, you could look at a table in Snowflake with a VARIANT column and quicklystart “shredding” the JSON with SQL. You can query semi-structured data withoutlearning a new programming language or using a framework required with Hadoop orNoSQL. Instead, you have a much lower learning curve to get the same result.6

CHAMPION GUIDESA MORE COMPLEXDATA LOADNested data and adding new attributesYes, those examples are very simple. So let’s look at something a bit more complex.Notice that the original sample document contains some nesting of the data:{"fullName": "Johnny Appleseed","age": 42,"gender": "Male","phoneNumber": {"areaCode": "415","subscriberNumber": "5551234","extensionNumber": "24"},{"fullName": "Johnny Appleseed","age": 42,"gender": "Male","phoneNumber": {"areaCode": "415","subscriberNumber": "5551234"},.How do we pull that apart? We use a very familiar table.column dot notation:selectv:phoneNumber.areaCode::string as area code,v:phoneNumber.subscriberNumber::string as subscriber numberfrom json demo;Just as fullName, age and gender are subcolumns, so too is phoneNumber. Andsubsequently, areaCode and subscriberNumber are subcolumns of the subcolumn.We can pull apart nested objects like this and easily adapt if the schema changes andwe add another subcolumn.WHAT HAPPENS IF THE STRUCTURE CHANGES?One of the benefits of storing data in JSON is that the schema can easily change. Butimagine if, in a subsequent load, the data provider changed the specification to this:.A new attribute, extensionNumber, was added to phoneNumber! What happens tothe load? Nothing. It keeps working because we ingest the string into the VARIANTcolumn in the table.You may ask, “What about the ETL/ELT code?” What code? There is no code, sothere’s nothing to break. And what about existing reports? They keep working, too.The previous query will work just fine. If you want to see the new column, the SQLneeds to be refactored to account for the change:selectv:phoneNumber.areaCode::string as area code,v:phoneNumber.subscriberNumber::string as subscriber number,v:phoneNumber.extensionNumber::string as extension numberfrom json demo;In addition, if the reverse happens and an attribute is dropped, the query will not fail.Instead, it simply returns a NULL value. In this way, Snowflake insulates all the codeyou write from these types of dynamic changes.7

CHAMPION GUIDESHOW TO HANDLE ARRAYS OF DATAOne of JSON’s many cool features is the ability to specify and embed an array ofdata within the document. In this example, one such array is children:"children": [{ "name": "Jayden", "gender": "Male", "age": "10" },{ "name": "Emma", "gender": "Female", "age": "8" },{ "name": "Madelyn", "gender": "Female", "age": "6" }]You will notice there are three rows in the array and each row has three subcolumns:name, gender, and age. Each of those rows constitutes the value of that array entry,which includes all the subcolumn labels and data. (Remember this for later.) So how doyou know how many rows there are if you don’t have access to the raw data? Like this:select array size(v:children) from json demo;The function ARRAY SIZE determines it for us. To pull the data for each row in thearray, we can use the previous dot notation, but with the added specification for therow number of the array located inside the brackets:If another element is added to the array, such as a fourth child, we will not have tochange the SQL. FLATTEN allows us to determine the structure and content of thearray on the fly. This makes the SQL resilient to changes in the JSON document.You can now get all the array sub-columns and format them just like a relational table:selectf.value:name::string as child name,f.value:gender::string as child gender,f.value:age::string as child agefrom json demo, table(flatten(v:children)) f;3 rows producedrow#CHILD NAMECHILD GENDERCHILD AGE1JaydenMale102EmmaFemale83MadelynFemale6select v:children[0].name from json demounion allselect v:children[1].name from json demounion allselect v:children[2].name from json demo;3 rows delyn"8

If you just want a quick count of children by parent, you do not need to useFLATTEN but instead you refer back to ARRAY SIZE:CHAMPION GUIDESPutting all this together, you can write a query to get the parent’s name and thechildren’s names:selectv:fullName::string as Parent Name,array size(v:children) as Number of Childrenfrom json demo;selectv:fullName::string as parent name,f.value:name::string as child name,f.value:gender::string as child gender,f.value:age::string as child agefrom json demo, table(flatten(v:children)) f;1 row produced3 rows producedrow#PARENT NAMECHILD NAMECHILD GENDERCHILD AGE1Johnny AppleseedJaydenMale102Johnny AppleseedEmmaFemale83Johnny AppleseedMadelynFemale6row#PARENT NAMENUMBER OF CHILDREN1Johnny Appleseed3Notice no GROUP BY clause is needed because the nested structure of the JSONhas naturally grouped the data for us.9

Simplifying an array with an arrayTo pull that data out, we add a second FLATTEN clause that transforms theyearsLived array within the FLATTENed citiesLived array.You may recall there are multiple arrays in the sample JSON string. You can pull fromseveral arrays at once with no problem:selectv:fullName::string as Parent Name,array size(v:citiesLived) as Cities lived in,array size(v:children) as Number of Childrenfrom json demo;CHAMPION GUIDESHOW TO HANDLE MULTIPLE ARRAYSselectcl.value:cityName::string as city name,yl.value::string as year livedfrom json demo,table(flatten(v:citiesLived)) cl,table(flatten(cl.value:yearsLived)) yl;In this case the second FLATTEN (alias “yl”) transforms, or pivots, the yearsLivedarray for each value returned from the first FLATTEN of the citiesLived array (“cl”).1 row producedrow#PARENT NAMECITIES LIVED INNUMBER OF CHILDREN1Johnny Appleseed43What about an array within an array? Snowflake can handle that, too. From thesample data, you can see yearsLived is an array nested inside the array describedby citiesLived:"citiesLived": [{ "cityName": "London","yearsLived": [ "1989", "1993",},{ "cityName": "San Francisco","yearsLived": [ "1990", "1993",},{ "cityName": "Portland","yearsLived": [ "1993", "1998",},{ "cityName": "Austin","yearsLived": [ "1973", "1998",}]"1998", "2002" ]"1998", "2008" ]"2003", "2005" ]"2001", "2005" ]The resulting output shows the year lived by city name:16 rows producedrow#CITY NAMEYear San Francisco19906San Francisco19937San Francisco19988San 200312Portland200510

CHAMPION GUIDESSimilar to the previous example, you can augment this result by adding the parent’sname to show who lived where:selectv:fullName::string as parent name,cl.value:cityName::string as city name,yl.value::string as year livedfrom json demo,table(flatten(v:citiesLived)) cl,table(flatten(tf.value:yearLived)) yl;16 rows producedrow#PARENT NAMECITY NAMEYear Lived1Johnny AppleseedLondon19892Johnny AppleseedLondon19933Johnny AppleseedLondon19984Johnny AppleseedLondon20025Johnny AppleseedSan Francisco19906Johnny AppleseedSan Francisco19937Johnny AppleseedSan Francisco19988Johnny AppleseedSan Francisco20089Johnny AppleseedPortland199310Johnny AppleseedPortland199811Johnny AppleseedPortland200312Johnny AppleseedPortland200511

How to execute standard SQL aggregations onsemi-structured dataCHAMPION GUIDESAGGREGATIONSTo answer the question you’re probably thinking: Yes! You can even executestandard SQL aggregations on the semi-structured data. So, just as with ANSI SQL,you can do a COUNT(*) and a GROUP BY:selectcl.value:cityName::string as city name,count(*) as year livedfrom json demo,table(flatten(v:citiesLived)) cl,table(flatten(tf.value:yearLived)) ylgroup by 1;4 rows producedrow#CITY NAMEYear Lived1London42San Francisco43Portland44Austin4You can also create much more complex analyses using the library of standard SQLaggregation and windowing functions including LEAD, LAG, RANK, and STDDEV.12

CHAMPION GUIDESFILTERING YOUR DATAHow to focus your data analytics to only the data you needWhat if you don’t want to return every row in an array? Similar to standard SQL,you add a WHERE clause:selectcl.value:cityName::string as city name,count(*) as years livedfrom json demo,table(flatten(v:citiesLived)) cl,table(flatten(tf.value:yearsLived)) ylwhere city name ‘Portland’group by 1;4 rows producedrow#CITY NAMEYear Lived1Portland4To make it easier to read the SQL, notice you can even reference the sub-columnalias city name in the predicate. You can also use the full, subcolumnspecification cl.value:cityName.13

CHAMPION GUIDESSCHEMA-ON-READ IS A REALITYGet access to all your data with the easeof SQLThe examples we’ve walked through show how veryeasy it is to load and analyze semi-structured datawith SQL, using Snowflake as both your big dataand data warehouse solution. Snowflake invented anew, optimized data type, VARIANT, which lives ina relational table structure in a relational database.VARIANT offers native support for querying JSONwithout the need to analyze the structure aheadof time or design appropriate database tables andcolumns, subsequently parsing the data string intothat predefined schema.VARIANT provides the same performance as all thestandard relational data types. In the examples, yousaw easy-to-learn extensions to ANSI-standard SQLfor accessing that data in a very flexible, resilientmanner. With Snowflake, you get the bonus of ondemand resource scalability that no traditional orcloud-washed data warehouse solution delivers.With these features, Snowflake gives you a fastpath to the enterprise endgame: the true ability toquickly and easily load semi-structured data into amodern cloud data platform and make it availablefor immediate analysis1414

ABOUT SNOWFLAKESnowflake delivers the Data Cloud—a global network where thousands of organizations mobilize data with near-unlimitedscale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover andsecurely share governed data, and execute diverse analytic workloads. Wherever data or users live, Snowflake delivers asingle and seamless experience across multiple public clouds. Snowflake’s platform is the engine that powers and providesaccess to the Data Cloud, creating a solution for data warehousing, data lakes, data engineering, data science, dataapplication development, and data sharing. Join Snowflake customers, partners, and data providers already taking theirbusinesses to new frontiers in the Data Cloud. snowflake.com.About the authorKent Graziano is a recognized industry expert, keynote speaker, and published author in the areas of data modeling, datawarehousing, and agile data. He has over 30 years of experience in information technology, including data modeling, dataanalysis, and relational database design, as well as large scale data warehouse architecture, design, and implementation. 2021 Snowflake Inc. All rights reserved. Snowflake, the Snowflake logo, and all other Snowflake product, feature and servicenames mentioned herein are registered trademarks or trademarks of Snowflake Inc. in the United States and other countries. Allother brand names or logos mentioned or used herein are for identification purposes only and may be the trademarks of theirrespective holder(s). Snowflake may not be associated with, or be sponsored or endorsed by, any such holder(s).snowflake.com #MobilizeYourData

Similar to the table.column notation all SQL people are familiar with, Snowflake has the ability to effectively specify a column within the column--a subcolumn. However, we cannot use the dot notation for our separator, because SQL syntax has already claimed that. So, the Snowflake team chose the next obvious thing: a