Cloudera Fast Forward Labs

Transcription

Cloudera Fast Forward LabsFederated Learning

Cloudera Fast Forward LabsFederated Learning

Copyright 2018 by Cloudera Fast Forward Labshttps://www.fastforwardlabs.comNew York, NY

To the future—

Contents1 Introduction132 The Problem and the Solution2.1 User Data on Smartphones2.2 Predictive Maintenance2.3 Medical AI2.4 The Federated Learning Setting2.5 A Federated Learning Algorithm2.6 Applicability2.7 Systems Issues15151719212326283 Prototype3.1 Predictive Maintenance Primer3.2 Why Federated Predictive Maintenance3.3 The CMAPSS Dataset3.4 Modeling the CMAPSS Data3.5 The Federated CMAPSS Model3.6 Product: Turbofan Tycoon33333536384043

4 Landscape4.1 Use Cases4.2 Users4.3 Tools and Vendors494955585 Ethics5.1 Privacy5.2 Consent5.3 Environmental Impact636368706 Future6.1 Reducing Communication Costs6.2 Personalization6.3 Sci-fi Story: Merrily, Merrily, Merrily, Merrily717174767 Conclusion818 Appendix: federated.py83

chapter 1IntroductionTo train a machine learning model you generally need tomove all the data to a single machine or, failing that, to a cluster of machines in a data center. This can be difficult for tworeasons.First, there can be privacy barriers. A smartphone usermay not want to share their baby photos with an applicationdeveloper. A user of industrial equipment may not want toshare sensor data with the manufacturer or a competitor.And healthcare providers are not totally free to share their patients' data with drug companies.Second, there are practical engineering challenges. Ahuge amount of valuable training data is created on hardwareat the edges of slow and unreliable networks, such as smartphones, IoT devices, or equipment in far-flung industrial facilities such as mines and oil rigs. Communication with suchdevices can be slow and expensive.This research report and its associated prototype introducefederated learning, an algorithmic solution to these problems.In federated learning, a server coordinates a network ofnodes, each of which has training data that it cannot or willnot share directly. The nodes each train a model, and it is thatmodel which they share with the server. By not transferringIntroduction 13

SERVERNODENODENODENODEfigure 1.1 .In federated learning, a network of nodes shares models rather than training data with a server.the data itself, federated learning helps to ensure privacy andminimizes communication costs.In this report, we discuss use cases ranging from smartphones to web browsers to healthcare to corporate IT to videoanalytics. Our working prototype focuses in particular on industrial predictive maintenance with IoT data, where trainingdata is a sensitive asset.14 Introduction

chapter 2The Problem and the SolutionIn this chapter, we will describe three hypothetical scenarios that seem very different, but which share characteristicsthat make them a great fit for federated learning. We’ll thenintroduce a specific federated learning algorithm, and explain how it helps. Finally, we’ll address the practical systemsproblems that complicate its use.2.1 User Data on SmartphonesPearSub: Important!Hey! I’ve gotthe ticketsCan’t wait!figure 2.1 Smartphones generate a wealth of data includingpictures, text messages and emails.The Problem and the Solution 15

Pear can help youmake an albumof your child toshare easilywith grandmawrite betteremailsFwd: Important!write bettertextsA N A LY Z I NGI’ve gotHey! Ive getthe ticketsUse ‘i think’lessit just needs access to your data.figure 2.2 Smartphone manufacturers could use that data topersonalize a user’s experience—but they require access to do so.Our first scenario concerns Pear, a company that makes apopular smartphone. Users love it. They use it to take picturesof their kids, email their colleagues, and write quick text messages to friends. All of this activity on millions of Pear phonesgenerates data that, if it were combined, would allow Pear totrain models to make its phones even better.Pear’s phones could learn to spot particularly good babyphotos and proactively offer to share them with friends andfamily. They could make it easier to write emails that are morelikely to receive quick replies. And they could make composing text messages even quicker and easier by accurately suggesting the next phrase, whatever the language.The difficulty is that many users are not comfortablesharing the training data these examples would require(baby photos, work emails, personal text messages) with a16 The Problem and the Solution

multinational corporation. Even among those who are notsensitive to privacy concerns, some will still refuse to sharetheir data because they don’t want to waste their bandwidthuploading data that will primarily benefit a private company.And among those who do choose to share their data, that datais often protected by laws that place significant administrative burdens on companies that wish to use it.Pear understands these concerns, and has earned a reputation as a company that takes privacy more seriously than itscompetitors. How can Pear add new predictive features to itsphones while respecting its users' privacy?2.2 Predictive MaintenanceTurbineCorp sells turbines for installation in power stations. These machines are profitable to run but expensive tomaintain, and very expensive to repair. TurbineCorp wants todifferentiate itself from its competitors by offering customersaccess to a predictive model of turbine failure.This model would use readings from sensors installed oneach turbine as input, and return an estimate of its remaininguseful life. A good model would reduce the likelihood of anexpensive failure by prompting the owner to maintain a turbine before it fails. It would also help avoid the almost equallyexpensive mistake of maintaining a turbine too early and toooften.This model needs training data—but testing lots of turbines until they failed in order to acquire that data would bean expensive endeavor for TurbineCorp. It would be less costly for TurbineCorp if its customers were to send it such data.More importantly, the failures actual customers experienceThe Problem and the Solution 17

figure 2.3 Turbine sensor data could be used to train a predictivemaintenance model.will be more representative of real-world use than those TurbineCorp would see in factory experiments. In short, trainingdata acquired from customers would be both cheaper andbetter.But there are several problems. Some of their customersare reluctant to share details about turbine failures in their facilities. Furthermore, some operate in countries such as China, where power stations are considered strategic assets, andare therefore legally prevented from exporting sensor data.And, as a practical matter, the volume of data generated by thedozens of sensors on each turbine is enormous, which makesstreaming it back to TurbineCorp an engineering challenge.How can TurbineCorp build an accurate predictive maintenance model without direct access to the best and cheapest18 The Problem and the Solution

n NephrodynenANALYZING.figure 2.4 Wearable medical devices collect data that couldsave lives.training data? (This scenario is the focus of our prototype, discussed in 3 Prototype.)2.3 Medical AINephrodyne, a medical device company, wants to offera wearable device that detects kidney problems in users before they become acutely serious. The company knows fromlab trials that its prototype rule-based device performs fairly well, but it does not have the accuracy required for use bynonexperts outside a hospital. Nephrodyne is confident that amachine learning model would perform even better than therule-based model, and its business team has determined thatGermany would be the best place to launch this product.The problem is, as a private company based in the United States, Nephrodyne is having trouble getting access toThe Problem and the Solution 19

DATANODEDATANODE?MODELDATANODEfigure 2.5 Federated learning helps when the data cannotbe moved.the representative training data required to build the model.Data protection regulations in the European Union place significant regulatory burdens on companies working with personal data, especially those outside the EU. Likewise, HIPAAregulations in the US make it difficult to work with US patientdata (and in any case, that American patient data would differsystematically from that of the target customers in Germany).And finally, anywhere in the world, but especially in Germany,patients are typically unwilling to share their personal datawith a private medical device company. Patients do, however,share their data with their healthcare providers.Given these constraints, how can Nephrodyne work withmultiple healthcare providers to train the accurate model itneeds to build its new product?20 The Problem and the Solution

DATANODEDATANODE?MODELDATANODEfigure 2.6 Federated learning helps when the data on each nodeis different.2.4 The Federated Learning SettingThese three scenarios share two common characteristics.These characteristics comprise what machine learning experts call the setting of federated learning.First, the most representative (and cheapest) training datacannot be moved away from its source. This characteristic isthe most important for identifying a problem that might be agood fit for federated learning. The reasons for this constraintcan include privacy concerns, regulatory impediments, andpractical engineering challenges.Second, each source of potential training data is differentfrom every other. Each of these sources will show biases relative to the overall dataset. For example, Pear wants to predictwhich emails will receive replies, but perhaps one user almostalways gets a reply, and one user almost never does. NeitherThe Problem and the Solution 21

user is typical. Or perhaps two of TurbineCorp’s customersstress their turbines in different ways and observe differentfailure modes. And maybe one of Nephrodyne’s users simplydoesn’t have much data. The fact that any one source’s data isbiased and small relative to the total dataset means that it’sdifficult to build a good global model based on data from asingle source.Here we’ll introduce the term node, used in this report torefer to a source of training data. A node might be a physicaldevice, a person, a facility or other geographic location, a legal entity, or even a country. If each node will not or cannotshare its data directly, or if you are concerned that any onenode might produce a biased subset of data, then you have aproblem that is potentially a good fit for federated learning!# Distributed Machine LearningFederated learning is one example of distributedmachine learning, but there exists another variety thatis currently more common and more mature: distributedmachine learning in the datacenter.1 In this setting, mostof the practical constraints of federated learning go away.In particular, we are free to move training data betweennodes, and communication between them is relativelyfast and cheap.Algorithms for distributed machine learning in the1 One of the earliest uses of the term "federated learning" draws thisdistinction in its title: see Jakub Konečný, Brendan McMahan, andDaniel Ramage’s paper "Federated Optimization: Distributed Optimization Beyond the Datacenter" (https://arxiv.org/abs/1511.03575).22 The Problem and the Solution

absence of the constraints of federated learning differin various ways that trade off speed, complexity, andaccuracy. In most circumstances, it’s best to let the toolyou’re using decide these algorithmic details. The clustercomputing framework Apache Spark supports distributedmachine learning out of the box and works well for manyuse cases. Dask, a library for parallel computing in Python,also supports machine learning and may be a good choicefor custom use cases or experimentation. Deep learningpackages like TensorFlow and PyTorch provide distributed implementations and can additionally be layered onmanagement platforms like YARN and Kubernetes2 forrobustness and fault tolerance.2.5 A Federated Learning AlgorithmSo, how does federated learning work?The crucial insight is to realize that the nodes, which arethe sources of training data, are not only data storage devices.They are also computers capable of training a model themselves. The federated solution takes advantage of this by training a model on each node.The server first sends each node an instruction to train amodel of a particular type, such as a linear model, a supportvector machine (SVM), or, in the case of deep learning, a particular network architecture.On receiving this instruction, each node trains the modelon its subset of the training data. Full training of a model would2 E.g., TonY (https://github.com/linkedin/TonY) and Kubeflow(https://www.kubeflow.org/).The Problem and the Solution 23

LocalModel1. Nodes receive model from server and start training.SERVERLocal ModelLocal ModelLocal ModelNODENODENODE2. Nodes send partially trained models to server.SERVERFederated Model3. The server combines those models to make a federated TALocalModelNODE4. The federated model is sent to the nodes. Repeat as necessary.figure 2.7 A federated learning algorithm.24 The Problem and the Solution

require many iterations of an algorithm such as gradient descent, but in federated learning the nodes train their modelsfor only a few iterations. In that sense, each node’s model ispartially trained after following the server’s instruction.The nodes then send their partially trained models back tothe server. Crucially, they do not send their training data back.The server combines the partially trained models to forma federated model. One way to combine the models is to takethe average of each coefficient, weighting by the amount oftraining data available on the corresponding node. This iscalled federate

And healthcare providers are not totally free to share their pa-tients' data with drug companies. Second, there are practical engineering challenges. A huge amount of valuable training data is created on hardware at the edges of slow and unreliable networks, such as smart-phones, IoT devices, or equipment in far-"ung industrial fa-