Principles Of Data Wrangling - Banking Tech

Transcription

ComplimentsofPrinciples ofData WranglingPRACTICAL TECHNIQUES FOR DATA PREPARATIONTye Rattenbury, Joe Hellerstein,Jeffrey Heer, Sean Kandel & Connor Carreras

Principles of Data WranglingPractical Techniques for Data PreparationTye Rattenbury, Joseph M. Hellerstein, Jeffrey Heer,Sean Kandel, and Connor CarrerasBeijingBoston Farnham SebastopolTokyo

Principles of Data Wranglingby Tye Rattenbury, Joseph M. Hellerstein, Jeffrey Heer, Sean Kandel, and Connor CarrerasCopyright 2017 Trifacta, Inc. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐tutional sales department: 800-998-9938 or corporate@oreilly.com.Editor: Shannon CuttProduction Editor: Kristen BrownCopyeditor: Bob Russell, Octal Publishing, Inc.Proofreader: Christina EdwardsInterior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Rebecca DemarestFirst EditionMay 2017:Revision History for the First Edition2017-04-25:First ReleaseThe O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Principles of Data Wrangling, the coverimage, and related trade dress are trademarks of O’Reilly Media, Inc.While the publisher and the authors have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the authors disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.978-1-491-98904-3[LSI]

Table of ContentsForeword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix1. Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Magic Thresholds, PYMK, and User Growth at Facebook32. A Data Workflow Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7How Data Flows During and Across ProjectsConnecting Analytic Actions to Data Movement: A Holistic WorkflowFramework for Data ProjectsRaw Data Stage Actions: Ingest Data and Create MetadataIngesting Known and Unknown DataCreating MetadataRefined Data Stage Actions: Create Canonical Data and Conduct Ad HocAnalysesDesigning Refined DataRefined Stage Analytical ActionsProduction Data Stage Actions: Create Production Data and Build AutomatedSystemsCreating Optimized DataDesigning Regular Reports and Automated Products/ServicesData Wrangling within the Workflow Framework811121214232426282929303. The Dynamics of Data Wrangling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Data Wrangling DynamicsAdditional Aspects: Subsetting and SamplingCore Transformation and Profiling ActionsData Wrangling in the Workflow FrameworkIngesting Data3132343636v

Describing DataAssessing Data UtilityDesigning and Building Refined DataAd Hoc ReportingExploratory Modeling and ForecastingBuilding an Optimized DatasetRegular Reporting and Building Data-Driven Products and Services373737383939404. Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Overview of ProfilingIndividual Value Profiling: Syntactic ProfilingIndividual Value Profiling: Semantic ProfilingSet-Based ProfilingProfiling Individual Values in the Candidate Master FileSyntactic Profiling in the Candidate Master FileSet-Based Profiling in the Candidate Master File434444454647485. Transformation: Structuring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51Overview of StructuringIntrarecord Structuring: Extracting ValuesPositional ExtractionPattern ExtractionComplex Structure ExtractionIntrarecord Structuring: Combining Multiple Record FieldsInterrecord Structuring: Filtering Records and FieldsInterrecord Structuring: Aggregations and PivotsSimple AggregationsColumn-to-Row PivotsRow-to-Column Pivots51525254555657575859596. Transformation: Enriching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61UnionsJoinsInserting MetadataDerivation of ValuesGenericProprietary6162636363647. Using Transformation to Clean Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67Addressing Missing/NULL ValuesAddressing Invalid Valuesvi Table of Contents6767

8. Roles and Responsibilities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Skills and ResponsibilitiesData EngineerData ArchitectData ScientistAnalystRoles Across the Data Workflow FrameworkOrganizational Best Practices697071717273749. Data Wrangling Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Data Size and InfrastructureData StructuresExcelSQLTrifacta WranglerTransformation ParadigmsExcelSQLTrifacta WranglerChoosing a Data Wrangling Tool78787979797980808182Table of Contents vii

ForewordThrough the last decades of the twentieth century and into the twenty-first, data waslargely a medium for bottom-line accounting: making sure that the books were bal‐anced, the rules were followed, and the right numbers could be rolled up for execu‐tive decision-making. It was an era focused on a select group of IT staff engineeringthe “golden master” of organizational data; an era in which mantras like “garbage in,garbage out” captured the attitude that only carefully engineered data was useful.Attitudes toward data have changed radically in the past decade, as new people, pro‐cesses, and technologies have come forward to define the hallmarks of a data-drivenorganization. In this context, data is a medium for top-line value generation, provid‐ing evidence and content for the design of new products, new processes, and ever‐more efficient operation. Today’s data-driven organizations have analysts workingbroadly across departments to find methods to use data creatively. It is an era inwhich new mantras like “extracting signal from the noise” capture a different attitudeof agile experimentation and exploitation of large, diverse sources of data.Of course, accounting still needs to get done in the twenty-first century, and the needremains to curate select datasets. But the data sources and processes for accountancyare relatively small and slow to change. The data that drives creative and exploratoryanalyses represents an (exponentially!) growing fraction of the data in most organiza‐tions, driving widespread rethinking of processes for data and computing—includingthe way that IT organizations approach their traditional tasks.The phrase data wrangling, born in the modern context of agile analytics, is meant todescribe the lion’s share of the time people spend working with data. There is a com‐mon misperception that data analysis is mostly a process of running statistical algo‐rithms on high-performance data engines. In practice, this is just the final step of alonger and more complex process; 50 to 80 percent of an analyst’s time is spent wran‐gling data to get it to the point at which this kind of analysis is possible. Not only doesdata wrangling consume most of an analyst’s workday, it also represents much of theanalyst’s professional process: it captures activities like understanding what data isix

available; choosing what data to use and at what level of detail; understanding how tomeaningfully combine multiple sources of data; and deciding how to distill the resultsto a size and shape that can drive downstream analysis. These activities represent thehard work that goes into both traditional data “curation” and modern data analysis.And in the context of agile analytics, these activities also capture the creative and sci‐entific intuition of the analyst, which can dictate different decisions for each use caseand data source.We have been working on these issues with data-centric folks of various stripes—from the IT professionals who fuel data infrastructure in large organizations, to pro‐fessional data analysts, to data-savvy “enthusiasts” in roles from marketing to journal‐ism to science and social causes. Much is changing across the board here. This bookis our effort to wrangle the lessons we have learned in this context into a coherentoverview, with a specific focus on the more recent and quickly growing agile analyticprocesses in data-driven organizations. Hopefully, some of these lessons will help toclarify the importance—and yes, the satisfaction—of data wrangling done well.x Foreword

CHAPTER 1IntroductionLet’s begin with the most important question: why should you read this book? Theanswer is simple: you want more value from your data. To put a little more meat onthat statement, our objective in writing this book is to help the variety of people whomanage the analysis or application of data in their organizations. The data might ormight not be “yours,” in the strict sense of ownership. But the pains in extractingvalue from this data are.We’re focused on two kinds of readers. First are people who manage the analysis andapplication of data indirectly—the managers of teams or directors of data projects.Second are people who work with data directly—the analysts, engineers, architects,statisticians, and scientists.If you’re reading this book, you’re interested in extracting value from data. We cancategorize this value into two types along a temporal dimension: near-term value andlong-term value. In the near term, you likely have a sizable list of questions that youwant to answer using your data. Some of these questions might be vague; for example,“Are people really shifting toward interacting with us through their mobile devices?”Other questions might be more specific: “When will our customers’ interactions pri‐marily originate from mobile devices instead of from desktops or laptops?”What is stopping you from answering these questions? The most common answer wehear is “time.” You know the questions, you know how to answer them, but you justdon’t have enough hours in the day to wrangle your data into the right form.Beyond the list of known questions related to the near-term value of your data is theoptimism that your data has greater potential long-term value. Can you use it to fore‐cast important seasonal changes? What about risks in your supply chain due toweather or geopolitical shifts? Can you understand how the move to mobile is affect‐ing your customers’ purchasing patterns? Organizations generally hire data scientists1

to take on these longer-term, exploratory analyses. But even if you have the requisiteskills to tackle these kinds of analyses, you might still struggle to be allocated suffi‐cient time and resources. After all, exploratory analytics projects can take months,and often contain a nontrivial risk of producing primarily negative or ambiguousresults.As we’ve seen, the primary impediment to realizing both the short-term and longterm value of your data is time: your limited time and your organization’s limitedtime. In this book, we describe how improving your data wrangling efforts can createthe time required to get more near-term and long-term value from your data. InChapters 1-3, we describe a workflow framework that links activities focused on bothkinds of value, and explain how data wrangling factors into those activities and intothe overall workflow framework. We introduce the basic building blocks for a datawrangling project: data flow, data wrangling activities, roles, and responsibilities.These are all elements that you will want to consider, at a high level, when embarkingon a project that involves data wrangling. Our goal is to provide some helpful guid‐ance and tips on how to coordinate your data wrangling efforts, both across multipleprojects by making sure your wrangling efforts are constructive as opposed to redun‐dant or conflicting, and within a single project by taking advantage of some standardlanguage and operations to increase productivity and consistency.There’s more to effective data wrangling than just clearly defined workflows and pro‐cesses; to most effectively wrangle your data, you should also understand whichtransformation actions constitute data wrangling, and, most important, how you canuse those transformations to produce the best datasets for your analytic activities.Those nitty-gritty transformations constitute our discussion in Chapters 4-7. You canthink of those chapters as a rough “how-to” guide for data wrangling. That said, wedo not intend this book to provide a comprehensive tutorial on all possible datawrangling methods. Instead, we want to give you a collection of techniques that youcan use when moving through the stages of the data workflow framework.As we introduce each of the key transformation and profiling activities that comprisedata wrangling, we will walk through a theoretical data project involving a publiclyavailable dataset containing US campaign finance information. You can walk throughthe project along with us in your data wrangling tool of choice.Finally, we end by discussing roles and responsibilities in a data wrangling project inChapter 8, and exploring a selection of data wrangling tools in Chapter 9.Throughout the book, we grou

Je rey Heer, Sean Kandel & Connor Carreras Principles of Data Wrangling PRACTICAL TECHNIQUES FOR DATA PREPARATION f. Tye Rattenbury, Joseph M. Hellerstein, Jeffrey Heer, Sean Kandel, and Connor Carreras Principles of Data Wrangling Practical Techniques for Data Preparation Beijing Boston Farnham Sebastopol Tokyo. 978-1-491-98904-3 [LSI] Principles of Data Wrangling by Tye Rattenbury,