Foundational Methodology For Data Science

Transcription

IBM AnalyticsWhite PaperFoundational Methodologyfor Data Science

2Foundational Methodology for Data ScienceIn the domain of data science, solving problems and answeringquestions through data analysis is standard practice. Often,data scientists construct a model to predict outcomes ordiscover underlying patterns, with the goal of gaining insights.Organizations can then use these insights to take actions thatideally improve future outcomes.There are numerous rapidly evolving technologies foranalyzing data and building models. In a remarkably shorttime, they have progressed from desktops to massivelyparallel warehouses with huge data volumes and in-databaseanalytic functionality in relational databases and ApacheHadoop. Text analytics on unstructured or semi-structureddata is becoming increasingly important as a way toincorporate sentiment and other useful information fromtext into predictive models, often leading to significantimprovements in model quality and accuracy.Emerging analytics approaches seek to automate many of thesteps in model building and application, making machinelearning technology more accessible to those who lack deepquantitative skills. Also, in contrast to the “top-down” approachof first defining the business problem and then analyzingthe data to find a solution, some data scientists may use a“bottom-up” approach. With the latter, the data scientist looksinto large volumes of data to see what business goal might besuggested by the data and then tackles that problem. Sincemost problems are addressed in a top-down manner, themethodology in this paper reflects that view.A 10-stage data science methodology thatspans technologies and approachesAs data analytics capabilities become more accessible andprevalent, data scientists need a foundational methodologycapable of providing a guiding strategy, regardless of thetechnologies, data volumes or approaches involved (seeFigure 1). This methodology bears some similarities torecognized methodologies1-5 for data mining, but it emphasizesseveral of the new practices in data science such as the use ofvery large data volumes, the incorporation of text analytics intopredictive modeling and the automation of some processes.The methodology consists of 10 stages that form an iterativeprocess for using data to uncover insights. Each stage plays avital role in the context of the overall methodology.What is a methodology?A methodology is a general strategy that guides theprocesses and activities within a given domain.Methodology does not depend on particulartechnologies or tools, nor is it a set of techniquesor recipes. Rather, a methodology provides the datascientist with a framework for how to proceed withwhatever methods, processes and heuristics will beused to obtain answers or results.

IBM tionDataunderstandingModelingDatapreparationFigure 1. Foundational Methodology for Data Science.Stage 1: Business understandingStage 2: Analytic approachEvery project starts with business understanding. The businesssponsors who need the analytic solution play the most criticalrole in this stage by defining the problem, project objectivesand solution requirements from a business perspective. Thisfirst stage lays the foundation for a successful resolution of thebusiness problem. To help guarantee the project’s success, thesponsors should be involved throughout the project to providedomain expertise, review intermediate findings and ensure thework remains on track to generate the intended solution.Once the business problem has been clearly stated, thedata scientist can define the analytic approach to solvingthe problem. This stage entails expressing the problemin the context of statistical and machine-learning techniques,so the organization can identify the most suitable ones forthe desired outcome. For example, if the goal is to predicta response such as “yes” or “no,” then the analytic approachcould be defined as building, testing and implementing aclassification model.3

4Foundational Methodology for Data ScienceStage 3: Data requirementsStage 6: Data preparationThe chosen analytic approach determines the datarequirements. Specifically, the analytic methods to be usedrequire certain data content, formats and representations,guided by domain knowledge.Stage 4: Data collectionThis stage encompasses all activities to construct the dataset that will be used in the subsequent modeling stage. Datapreparation activities include data cleaning (dealing withmissing or invalid values, eliminating duplicates, formattingproperly), combining data from multiple sources (files, tables,platforms) and transforming data into more useful variables.In the initial data collection stage, data scientists identify andgather the available data resources—structured, unstructuredand semi-structured—relevant to the problem domain.Typically, they must choose whether to make additionalinvestments to obtain less-accessible data elements. Itmay be best to defer the investment decision until more isknown about the data and the model. If there are gaps indata collection, the data scientist may have to revise the datarequirements accordingly and collect new and/or more data.In a process called feature engineering, data scientists can createadditional explanatory variables, also referred to as predictorsor features, through a combination of domain knowledgeand existing structured variables. When text data is available,such as customer call center logs or physicians’ notes inunstructured or semi-structured form, text analytics is useful inderiving new structured variables to enrich the set of predictorsand improve model accuracy.While data sampling and subsetting are still important,today’s high-performance platforms and in-database analyticfunctionality let data scientists use much larger data setscontaining much or even all of the available data. Byincorporating more data, predictive models may be betterable to represent rare events such as disease incidence orsystem failure.Data preparation is usually the most time-consuming step ina data science project. In many domains, some data preparationsteps are common across different problems. Automatingcertain data preparation steps in advance may accelerate theprocess by minimizing ad hoc preparation time. With today’shigh-performance, massively parallel systems and analyticfunctionality residing where the data is stored, data scientists canmore easily and rapidly prepare data using very large data sets.Stage 5: Data understandingAfter the original data collection, data scientists typicallyuse descriptive statistics and visualization techniques tounderstand the data content, assess data quality and discoverinitial insights about the data. Additional data collection maybe necessary to fill gaps.Stage 7: ModelingStarting with the first version of the prepared data set, themodeling stage focuses on developing predictive or descriptivemodels according to the previously defined analytic approach.With predictive models, data scientists use a training set(historical data in which the outcome of interest is known)to build the model. The modeling process is typically highly

IBM Analyticsiterative as organizations gain intermediate insights, leading torefinements in data preparation and model specification. Fora given technique, data scientists may try multiple algorithmswith their respective parameters to find the best model for theavailable variables.Stage 8: EvaluationDuring model development and before deployment, thedata scientist evaluates the model to understand its qualityand ensure that it properly and fully addresses the businessproblem. Model evaluation entails computing variousdiagnostic measures and other outputs such as tables andgraphs, enabling the data scientist to interpret the model’squality and its efficacy in solving the problem. For a predictivemodel, data scientists use a testing set, which is independent ofthe training set but follows the same probability distributionand has a known outcome. The testing set is used to evaluatethe model so it can be refined as needed. Sometimes the finalmodel is applied also to a validation set for a final assessment.In addition, data scientists may assign statistical significancetests to the model as further proof of its quality. This additionalproof may be instrumental in justifying model implementationor taking actions when the stakes are high—such as anexpensive supplemental medical protocol or a critical airplaneflight system.Stage 9: DeploymentOnce a satisfactory model has been developed and is approvedby the business sponsors, it is deployed into the productionenvironment or a comparable test environment. Usually itis deployed in a limited way until its performance has beenfully evaluated. Deployment may be as simple as generating areport with recommendations, or as involved as embedding the5model in a complex workflow and scoring process managed bya custom application. Deploying a model into an operationalbusiness process usually involves additional groups, skills andtechnologies from within the enterprise. For example, a salesgroup may deploy a response propensity model through acampaign management process created by a development teamand administered by a marketing group.Stage 10: FeedbackBy collecting results from the implemented model, theorganization gets feedback on the model’s performance andits impact on the environment in which it was deployed.For example, feedback could take the form of response ratesto a promotional campaign targeting a group of customersidentified by the model as high-potential responders. Analyzingthis feedback enables data scientists to refine the model toimprove its accuracy and usefulness. They can automatesome or all of the feedback-gathering and model assessment,refinement and redeployment steps to speed up the process ofmodel refreshing for better outcomes.Providing ongoing value tothe organizationThe flow of the methodology illustrates the iterative natureof the problem-solving process. As data scientists learnmore about the data and the modeling, they frequentlyreturn to a previous stage to make adjustments. Models arenot created once, deployed and left in place as is; instead,through feedback, refinement and redeployment, models arecontinually improved and adapted to evolving conditions. Inthis way, both the model and the work behind it can providecontinuous value to the organization for as long as the solutionis needed.

For more informationA new course on the Foundational Data Science Methodologyis available through Big Data University. The free onlinecourse is available at: a-science-methodology Copyright IBM Corporation 2015For working examples of how this methodology has beenimplemented in actual use cases, visit:IBM AnalyticsRoute 100Somers, NY 10589 http://ibm.co/1SUhxFm http://ibm.co/1IazTvGProduced in the United States of AmericaJune 2015AcknowledgementsThanks to Michael Haide, Michael Wurst, Ph.D., BrandonMacKenzie and Gregory Rodd for their helpful commentsand to Jo A. Ramos for his role in the development of thismethodology over our years of collaboration.About the authorJohn B. Rollins, Ph.D., is a data scientist in the IBM Analyticsorganization. His background is in engineering, data mining andeconometrics across many industries. He holds seven patentsand has authored a best-selling engineering textbook and manytechnical papers. He holds doctoral degrees in petroleumengineering and economics from Texas A&M University.IBM, the IBM logo, and ibm.com are trademarks of International BusinessMachines Corp., registered in many jurisdictions worldwide. Other productand service names might be trademarks of IBM or other companies. A currentlist of IBM trademarks is available on the web at “Copyright and trademarkinformation” at ibm.com/legal/copytrade.shtmlThis document is current as of the initial date of publication and may bechanged by IBM at any time. Not all offerings are available in everycountry in which IBM operates.THE INFORMATION IN THIS DOCUMENT IS PROVIDED “ASIS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED,INCLUDING WITHOUT ANY WARRANTIES OFMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSEAND ANY WARRANTY OR CONDITION OF NONINFRINGEMENT. IBM products are warranted according to the termsand conditions of the agreements under which they are provided.1 rachman, R. & Anand, T., “The process of knowledge discovery inBdatabases,” in Fayyad, U. et al., eds., Advances in knowledge discovery anddata mining, AAAI Press, 1996 (pp. 37-57)2 AS Institute, http://en.wikipedia.org/wiki/SEMMA, www.sas.com/Sen us/software/analytics/enterprise-miner.html, www.sas.com/en ning.html3 ikipedia, “Cross Industry Standard Process for Data Mining,” http://Wen.wikipedia.org/wiki/Cross Industry Standard Process for DataMining, http://the-modeling-agency.com/crisp-dm.pdf4 allard, C., Rollins, J., Ramos, J., Perkins, A., Hale, R., Dorneich, A.,BMilner, E., and Chodagam, J.: Dynamic Warehousing: Data MiningMade Easy, IBM Redbook SG24-7418-00 (Sep. 2007), pp. 9-26.5 regory Piatetsky, CRISP-DM, still the top methodology for analytics,Gdata mining, or data science projects, Oct. 28, 2014, ease RecycleIMW14828-USEN-00

2 Foundational Methodology for Data Science In the domain of data science, solving problems and answering questions through data analysis is standard practice. Often, data scientists construct a mod