Machine Learning Engineering For The Real World - Databricks

Transcription

EB O O KMachine LearningEngineering for theReal WorldA step-by-step guide to takeyour ML projects from planningto productionThis eBook is based on excerpts from ”Machine LearningEngineering in Action” by Ben Wilson, published by Manning.

Machine Learning Engineering for the Real WorldIntroductionIn this eBook, you will learn about the key components of machinelearning engineering, how to successfully implement large-scalemachine learning projects from scoping to deployment, and thekey teams and personas involved in executing successful machinelearning initiatives at companies.2

Machine Learning Engineering for the Real World3ContentsSECTION 1ML engineering as a concept .SECTION 251.1Why ML engineering? . 71.2The core tenets of ML engineering . 10Planning . 10Your data science could use some engineering .2.128Augmenting a complex profession with processesthat lead to greater success in project work . 292.2 A foundation of simplicity . 31Scoping and research . 12Experimentation . 142.3 Co-opting principles of agile software engineering . 32Development . 16Communication and cooperation . 33Deployment . 19Embracing and expecting change .35Evaluation . 221.3The goals of ML engineering . 241.4Summary .262.4 The foundation of ML engineering . 362.5 Summary . 37

01SECTIONML engineeringas a concept

Machine Learning Engineering for the Real World5SECTION 1ML engineering as a conceptMachine learning (ML) is exciting. To the layperson, it brings with it the promise ofThis eBook is a roadmap to guide you through this system. As shown in Figure 1.1, itseemingly magical abilities of soothsaying, uncovering mysterious and miraculousentails a proven set of processes about the planning phase of project work — includinganswers to difficult problems. ML makes money for companies, it autonomously tacklesnavigating the difficult and confusing translation of business needs into the language ofoverwhelmingly large tasks, and it removes the burdensome task of monotonous work. ToML work. From that, it covers a standard methodology of experimentation work, focusingstate the obvious, though, it’s challenging. Using thousands of algorithms and requiring aon the tools and coding standards for creating a minimum viable product (MVP) that willdiverse skill set ranging from data engineering (DE) to advanced statistical analysis andbe comprehensive and maintainable. Finally, it covers the various tools, techniques andvisualization, the work of a professional ML practitioner is complex and truly intimidating.nuances involved in crafting production-grade maintainable code that is both extensibleand easy to troubleshoot.ML engineering is the concept of applying a system around this staggering level ofcomplexity. It is a set of standards, tools, processes and methodology that aims tominimize time wasted on abandoned, misguided or irrelevant work when solving abusiness problem or need. It, in essence, is the roadmap to creating ML-based systemsthat can not only be deployed to production, but also maintained and updated for yearsinto the future, allowing businesses to reap the rewards in efficiency, profitability andaccuracy that ML in general has proven to provide (when done correctly).MACHINE LEARNINGENGINEERINGThe most critical — andunderrated — elementof the machine learninglifecycle

Machine Learning Engineering for the Real World6However, ML engineering is not exclusively about the path shown in Figure 1.1. It is alsoThe end goal of ML work is, after all, about solving a problem. By embracing the conceptsabout the methodology within each of these stages, which can make or break a project.of ML engineering and following the road of effective project work, the end goal of gettingIt is the way in which a data science team talks to a business about a problem, the mannera useful modeling solution can be shorter and far cheaper and have a much higherin which research is done, the details of experimentation, the way the code is written, andprobability of succeeding than if you just “wing it” and hope for the best.the multitude of tools and technology that are employed along the roadmapped path thatcan greatly reduce the worst fate of any project: abandonment.The ML engineering roadmapA set of rules and stages to encourage timely and cost-effective ML project workPlanning What do you want built? When do you want it built by?Prevention of thedreaded, “This isn’t whatwe asked for.”Scoping What are we going to build? What are we going to test?Prevention of “Well,we didn’t realizehow complicated orexpensive this wasgoing to be.”Experimentation Will this work? How does this compare toother approaches?Prevention of “We’ve beenworking on a solution formonths, but still don’t havea plan of how to solve this.”Development Build on the right environment Build maintainable code MLOps integrationAlso “I have no idea howthis works, I’ll have totalk to whomever wrotethis to fix it.”DeploymentPrevention of “Well, the job blew upagain from an out of memory error.I guess we just have to rerun it, andhope for the best, right?”Prevention of “Hangon, this worked justfine yesterday, why is ittotally broken today?” Make it run in productionat scale Monitor performanceEvaluation andImprovement Collect the right metrics Adjust and adapt to changes“Is this doing what wewanted it to do?”Figure 1.1The ML engineering roadmapshows the proven stagesof work involved in creatingsuccessful ML solutions.While some projects mayrequire additional steps(particularly if workingwith additional engineeringteams), these are thefundamental steps thatshould be involved in anyML-based project.

Machine Learning Engineering for the Real World7SECTION 1: ML ENGINEERING AS A CONCEPT1 .1Why ML engineering?To put it most simply, ML is hard. It’s even harder to do correctly in the sense of servingThe aim of ML engineering is not to iterate through the lists of such skills and requirerelevant predictions, at scale, with reliable frequency. With so many specialties existing inthat a data scientist master each of them. Instead, it’s to treat it as a collection of certainthe field (NLP, forecasting, deep learning, traditional linear and tree-based modeling, etc.),aspects of those skills, carefully crafted to be relevant to data scientists, all with the goalan enormous focus on active research and so many algorithms that have been built toof increasing the chances of getting an ML project into production and to make sure thatsolve specific problems, it’s remarkably challenging to learn more than a tiny fraction ofit’s not a solution that needs constant maintenance and intervention to keep running.all there is to learn. Coupling that complexity with the fact that one can develop a modelon everything from a Raspberry Pi to an enormous NVIDIA GPU cluster, the very platformAn ML engineer, after all, doesn’t need to be able to create applications and softwarecomplexities that are out there present an entirely new set of information that no oneframeworks for generic algorithmic use cases. They’re also not likely to be writing theirperson could have enough time in their life to learn.own large-scale streaming ingestion ETL pipelines. Nor do they need to be able to createdetailed and animated front-end visualizations in JavaScript.There are also additional realms of competency that a data scientist is expected to befamiliar with. They include midlevel data engineering skills (you have to get your dataAn ML engineer needs to know just enough software development skills to be able tofor data science somewhere, right?) as well as skills in software development, projectwrite modular code and to implement unit tests. They don’t need to know about themanagement, visualization and presentation — and the list continues to grow, makingintricacies of nonblocking asynchronous messaging brokering. They need just enoughit rather daunting to gain all the necessary experience. So it shouldn’t be surprising,data engineering skills to build (and schedule the ETL for) feature data sets for theirconsidering all of this, why attaining all the required skills to create production-grade MLmodels, but they don’t need to construct a PB-scale streaming ingestion framework. Theysolutions is beyond the reach of most individuals.need just enough visualization skills to create plots and charts that communicate clearlywhat their research and models are doing, but they don’t have to develop dynamic webapps with complex UX components. They also need just enough project managementexperience to know how to properly define, scope and control a project to solve aproblem, but they don’t need to go through a PMP certification.

Machine Learning Engineering for the Real World8Despite many companies going all in on ML, hiring massive teams of highly compensatedCost: 10%data scientists and devoting huge amounts of financial and temporal resources, theseHubris: 5%projects end up failing at incredibly high rates. This eBook covers the six major causes ofproject failure and why they result in so many projects failing, being abandoned or takinglonger than necessary to reach production. In each section, we will show the solutions toPlanning: 30%Fragility: 15%these common problems and explain the processes involved in reaching those solutionsso you can significantly lower the chances of your project getting derailed.These issues are not the result of malicious intent. Rather, they are due in large part to thefact that most ML projects are incredibly challenging and complex, and are composed ofalgorithmic software tooling that is hard to explain to a layperson (hence the breakdownsTechnology: 15%in communication with business units that most projects endure). With such complexsolutions in play, so many moving parts and a world of corporations trying to win in this newdata-focused arms race and profit from ML as quickly as possible, it’s no wonder that theperilous journey of taking a solution to a point of stability in production fails so frequently.Scoping: 25%This book is meant to show how these elements pose a risk to projects and to teachFigure 1.2Primary reasons for ML project failures. Figure 1.2 shows some rough estimates of the primaryreasons why projects fail. Most commonly, data science teams are either inexperienced with usinga large-scale production-grade model to solve a particular need or simply fail to understand whatthe desired outcome from the business is.the tools that help minimize the risk of each. By focusing on each of these areas in aconscientious and deliberate manner, many of these risks can be significantly mitigated, ifnot eliminated entirely.

Machine Learning Engineering for the Real World9In Figure 1.3, you’ll see a representation of the path that all of us are moving on when weUnfortunate detours of ML project work on the road to productionemploy ML to solve a problem. From the outset of a project to its planned successful,long running and maintainable state, the journey is fraught with detours that can spell thetermination of our hard work. However, by focusing on building up our knowledge, skills andPlanning Problemsutilization of processes and tooling, we can generally avoid these six major problematicML Project Startareas — or, at the very least, we can address them in a way that won’t cause a completefailure of a project.Solution doesn’t solvethe problemScoping ProblemsSolution is too complexor expensiveSolution takes too longto developCan’t explain how it worksProblem complexityunderestimatedInsufficient time forskills acquisitionExperimentation IssuesML engineering is designed to address each of the primary failure modes shown inFigure 1.3. Eliminating the chances of failure is at the heart of this methodology. This isdone through processes that lead to better decisions, ease communication with internalcustomers, eliminate rework during the experimentation and development phases, createcode bases that can be easily maintained, and bring a best-practices approach to anyproject work that is heavily influenced by data science work. Just as software engineersdecades ago refined their processes from large-scale waterfall implementations to aToo many approachestested for too longUnreproducible resultsOver-engineeredcomplexityDevelopment Issuesmore flexible and productive agile process, ML engineering seeks to define a new set ofUnstable / fragile /non-performant codepractices and tools that will optimize the wholly unique realm of software development forLate-stageimplementation changeDeployment IssuesEvaluation IssuesScalability ( or time)problemsHigh cost vs. value ofsolutionFailure to meet SLADrift causing instabilityInadequate architectureCan’t explain solution valueSuccessfullong-running MLsolutionFigure 1.3The branching paths of failure in the vast majority of ML projects. Nearly all ML solutions thatdon’t focus on these six core areas have a much higher chance of being abandoned either beforeproduction or shortly after running in production.data scientists.

Machine Learning Engineering for the Real World10SECTION 1: ML ENGINEERING AS A CONCEPTThe core tenets of ML engineering1.2Now that you have a general idea of what defines ML engineering, we can focus on the keyWith so many options of varying complexity and approaches, and with very little guidance,elements:the possibility of creating a solution that is aligned with the expectations of the executivePlanningis highly unlikely.If a proper planning discussion had taken place, the true expectation might be revealed:Neglecting to plan out projects thoroughly is the biggest cause of their failure by far — anda prediction for each user about when they would be most inclined to read emails.it’s one of the most demoralizing ways for them to be canceled. Imagine for a moment thatThe executive simply wants to know when someone is most likely to not be at work,you’re the first data scientist hired by your company. In your first week, an executive fromcommuting or sleeping so that they can send batches of emails throughout the day tomarketing approaches you, explaining (in their terms) a serious business issue that theydifferent cohorts of customers.are facing. They need to figure out an efficient means of communicating to customersthrough email about upcoming sales they might be interested in. With very little additionalThe sad reality is that many ML projects start off this way. There is frequently very littledetail provided to you, the executive merely says, “I want to see the click and open ratescommunication with regard to project initiation, and the general expectation is that “thego up.”data science team will figure it out.” However, without the proper guidance on what needsto be built, how it needs to function and what the end goal of the predictions is, theIf this were the only information supplied and if members of the marketing team answeredproject is almost doomed to failure.your repeated queries by simply stating the same end goal of increasing the clicking andopening rates, there would seem to be a limitless number of avenues to pursue. Left toAfter all, what would have happened if an entire content recommendation system hadyour own devices, do you:been built for that use case, with months of development and effort put in, when a very Focus on content recommendation and craft custom emails for each user? Provide predictions with an NLP-backed system that will craft relevant subject linesfor each user? Attempt to predict a list of products most relevant to the customer base to put onsale each day?simple analytics query based on IP geolocation was what was really needed? The projectwould not only be canceled, but there would likely be many questions from on high as towhy this system was built and why development was so expensive.If we were to look at a very simplified planning discussion at an initial phase, as shown inFigure 1.4, we can see how just a few careful questions and clear answers can give the onething that every data scientist should be looking for in this situation: a quick win.

Machine Learning Engineering for the Real World11As Figure 1.4 shows, the problem at hand is not at all in the list of original assumptionsAn effective high-level project planning sessionThe Data Scientisttaking a few minutes to plan and understand the use case fully, weeks (if not months) of(business sponsor)wasted effort, time and money can be saved.“What is theproject?”OK. Maybe they want more relevantemails? Better subject lines? Maybecustom recommendations?“How is it done now,if at all?”“We send emails every day at 8 AMand 3 PM local time for us.”Seems like they’re more concernedabout the time of sending than thecontent of the email. Perhaps aregression problem?“What businessneed is thissolving?”“We want to know when the besttime to send our emails are foreach user based on their local timezone and when they’ve openedemails in the past.”Ah. It’s an optimization problem tofigure out when to send an email.They’re not concerned with contentrecommendations.“It will hopefully drive more users tothe site and increase sales.”This seems like a stretch. There areprobably going to be too many latentfactors influencing this. Ask morequestions.“What would makeyou consider this asuccess?”“If the opening rates and loginsfrom the email link go up, we wouldconsider it a success.”“When would you beready to test this?”“As soon as possible, ideally withresults by next quarter so we canknow what to focus on next.”“Who from yourteam can I workwith?”“I will make sure that our 3 subjectmatter experts are available to assistwith the project.”line or the items in the email. It’s a simple analytical query to figure out which time zonecustomers are in and to analyze historic openings in local times for each customer. ByThe Marketing Executive“We need to increase our openingrates of our marketing emails to drivemore people to the site.”“How do youneed to use thepredictions?”that were made. There is no talk about the content of the emails, relevancy to the subjectBy focusing on what will be built and why it needs to be built, both the data scienceteam and the business are able to guide the discussion more fruitfully. Eschewing aconversation focused on how it will be built keeps the data science members of the groupfocused on the problem. Ignoring when it will be built helps the business keep their focusaligned on the needs of the project.Avoiding any discussion of implementation details at this stage allows the data scienceteam to focus on the problem, which is critical. Keeping the esoteric details of algorithmsHere’s the business metric that asolution will be measured against.Increase open rates.Getting clarification on expectationsand priorities can help to build trust.This is critical to project success.Get the SMEs on board early.Figure 1.4A simplified planning discussion to get to the root of what an internal customer — in this case, themarketing executive who wants high open rates on their emails — actually needs for a solutionand solution design out of discussions with the larger team allows the business unitmembers to stay engaged. After all, they really don’t care how many eggs go into the mix,what color the eggs are or even what species laid the eggs — they just want to eat thecake when it’s done.

Machine Learning Engineering for the Real WorldScoping and research12Research and scoping comparison for a fraud detection problemThe focus of scoping and research needs to be on the two biggest questions that internalTEAM Acustomers (the business) have about the project.“Applied DS Engineers” Is this going to solve my problem? How long is this going to take?Let’s look at another potentially familiar scenario to discuss polar opposite ways that thisstage of ML project development can go awry. For this example, there are two separateInadequate research. Should haveread more in depth on the topicto see all of the hidden “gotchas”in this approach.The only problem here is thelength of research. A day ofsearching is insufficient.Search the internet for ideasand examples of how to solvethe problem (1 day)data science teams at a company — Team A in Figure 1.5 and Team B in Figure 1.6 — eachbeing pitted against one another to develop a solution to an escalating incidence of fraudFind blog post on frauddetection using XGBoost.(same day)occurring with the company’s billing system.Underestimating ML projectcomplexity and deliveryexpectations is dangerous. Youcan always underpromise andoverdeliver, but the inverse neverworks.Team A’s research and scoping process is illustrated in Figure 1.5.Team A is composed of mostly junior data scientists, all of whom entered the workforcewithout an extensive period in academia. They proceed, upon learning the details of theInadequate research, a rushedimplementation, and a failure tounderstand both the algorithmand the nuances of the problemresult in failure.“Should take about 2 weeksto build!”project and what is expected of them, to immediately go to blog posts. The team searchesthe internet for “detecting payment fraud” and “fraud algorithms,” finding hundreds ofresults from consultancy companies, a few extremely high-level blog posts from similarjunior data scientists who have likely never put a model into production, and some opensource — and very rudimentary — data examples.Both the false positive andfalse negative rates are atrocious.This model is useless.Business ResponseFigure 1.5Research and scoping of a fraud detection problem for a junior team of well-intentioned butinexperienced data scientists

Machine Learning Engineering for the Real World13Team B’s research and scoping, shown in Figure 1.6, stands in contrast.Research and scoping comparison for a fraud detection problemTeam B is filled with a group of Ph.D. academic researchers. With their studious approachTEAM Bto research and the vetting of ideas, their first actions are to dig into published papers onthe topic of fraud modeling. Spending several days reading through journals and papers,“Academic Researchers”they are now armed with a large collection of theory encompassing some of the mostcutting-edge research being done on detecting fraudulent activity.If you were to ask either of these teams what the level of effort is to produce a solution,Might not be the most wisedecision to focus on cutting-edgeresearch for a business problem.Highly recommendedtactic with a thoroughresearch phase.Search ieee, arXiv and tradejournals for prior research onthe topic (2 weeks)you would get wildly divergent answers. Team A would likely state that it would take abouttwo weeks to build their XGBoost binary classification model (they mistakenly believe thatDiscover highly cited paper onusing neural networks with agenetic algorithm for advancedfraud detectionthey already have the code, after all, from the blog post that they found).Team B would tell a vastly different tale. They’d estimate it would take several months toimplement, train and evaluate the novel deep learning structure that they found in a highlyregarded whitepaper whose proven accuracy for the research was significantly betterthan any perforce implemented algorithm for this use case.But with their approaches to scoping and research, these two teams — polar opposites— would both see their projects fail, although for two completely different reasons. TeamA would have a project failure because the solution to the problem is significantly morecomplex than the example shown in the blog post they found (the class imbalance issuealone is too challenging of a topic to effectively document in the short space of a blogpost). Team B, even though their solution would likely be extremely accurate, would neverWith the right team, the solutionlikely would be fantastic. But thecost of novel implementation inthe long run far outweighs thepotential accuracy gains.Extremely risky, as it involvesbuilding and owning notonly an ML solution, but analgorithm as well.“Well, there’s no package outthere that offers this algorithm,so we’re going to haveto implement the paperfrom scratch.”“We simply do not have4 months and the requiredbudget for large multi-GPUVMs to make this work.”be allocated resources to build such a risky solution as an initial fraud detection service atthe company (although it would be a great candidate for a version 2.0 implementation).Business ResponseFigure 1.6Research and scoping for an academia-focused group of researchers for the fraud detection problem

Machine Learning Engineering for the Real World14Project scoping for ML is incredibly challenging. Even for the most seasoned of MLveterans, making a conjecture about how long a project will take, which approach is goingExperimentation for multi-class image classification projectto be most successful and the amount of resources that will need to be involved is a futileand frustrating exercise. The risk associated with making erroneous claims is fairly high,TEAM Abut there are means of structuring proper scoping and solution research that can help“Cow(girls/boys)”minimize the chances of an estimation being wildly off.Woefully inadequate research leadingto a single approach to be tested.Find blog that shows how to use apre-trained CNN to classify dogs andcats.Most companies have a mix of the types of people in the hyperbolic scenario previouslymentioned. There are academics whose sole goal is to further the advancement ofknowledge and research into algorithms, paving the way for future discoveries from withinindustry. There are also “applications of ML” engineers who just want to use ML as a toolto solve a business problem. It’s very important to embrace and balance both aspects ofthese philosophies toward ML work, strike a compromise during the research and scopingphase of a project, and know that the middle ground here is the best path to trod upon toInadequate testing on cherry-pickedsamples from the training data hidesthe flaws in this implementation.Testing on a cherrypicked subset of the dataobscures complexity of thisimplementationTake 2 classes of products, unlock thelast 25% of the network for re-learning,execute training.ensure that a project actually makes it to production.Demo the results. Classification ofthe two classes is pretty good.E X P E R I M E N TAT I O NIn the experimentation phase, the largest cause of project failure is either experimentationthat takes too long (testing too many things or spending too long fine-tuning an approach)or an underdeveloped prototype that is so abysmally bad that the business decides toShortcuts during experimentationand a “rushed approach” simplyhid the issues that are seen duringfull development.DEV ELO PM ENT PHA S EModel is trained on full data set.Results indicate that the learnedattributes are product color and pattern,rendering it useless.move on to something else.An example in section 1.2.2 illustrates how these two approaches might play out at aRework from scratch or abandoncompany that is looking to build an image classifier for detecting products on retail storeshelves. The experimentation paths that the two groups take (representing the extremepolar opposites of experimentation) are shown in Figures 1.7 and 1.8.Figure 1.7A rushed experimentation phase by a team of inexperienced data scientistsEither result causes thebusiness to lose confidence inthe team.

Machine Learning Engineering for the Real World15Team A in Figure 1.7 is an exceedingly hyperbolic caricature of an exceptionallyinexperienced data science team, performing only the most cursory of research. Using theExperimentation for multi-class image classification projectsingle example blog post that they found regarding image classification tasks, they copyTEAM Bthe example code, use the exact pretrained TensorFlow-Keras model cited in the blog,“Cow(girls/boys)”retrain the model on only a few hundred images of just two of the products (out of manythousands), and demonstrate a fairly solid result in classification for the holdout imagesfrom these two classes.Good approach for a topicas complex as this.Spend 2 weeks researchingoptions and vetting themBut because they didn’t do thorough research, they were unable to understand theThat’s going to takea long time

In this eBook, you will learn about the key components of machine learning engineering, how to successfully implement large-scale machine learning projects from scoping to deployment, and the key teams and personas involved in executing successful machine learning initiatives at companies. Introduction Machine Learning Engineering for the Real .