Privacy-Preserving Machine Learning Algorithms For Big .

Transcription

2015 IEEE 35th International Conference on Distributed Computing SystemsPrivacy-preserving Machine Learning Algorithms forBig Data SystemsKaihe Xu , Hao Yue , Linke Guo† , Yuanxiong Guo‡ , Yuguang Fang Departmentof Electrical and Computer Engineering,University of Florida, Gainesville, FL 32611, USA{xukaihe, hyue}@ufl.edu, fang@ece.ufl.edu† Department of Electrical and Computer Engineering,Binghamton University, Binghamton, NY 13902, USAlguo@binghamton.edu‡ School of Electrical and Computer Engineering, Oklahoma State University, Stillwater, OK 74078, USArichard.guo@okstate.eduAbstract—Machine learning has played an increasing important role in big data systems due to its capability of efficientlydiscovering valuable knowledge and hidden information. Oftentimes big data such as healthcare systems or financial systemsmay involve with multiple organizations who may have differentprivacy policy, and may not explicitly share their data publiclywhile joint data processing may be a must. Thus, how to share bigdata among distributed data processing entities while mitigatingprivacy concerns becomes a challenging problem. Traditionalmethods rely on cryptographic tools and/or randomization topreserve privacy. Unfortunately, this alone may be inadequate forthe emerging big data systems because they are mainly designedfor traditional small-scale data sets. In this paper, we propose anovel framework to achieve privacy-preserving machine learningwhere the training data are distributed and each shared dataportion is of large volume. Specifically, we utilize the data localityproperty of Apache Hadoop architecture and only a limitednumber of cryptographic operations at the Reduce() procedures toachieve privacy-preservation. We show that the proposed schemeis secure in the semi-honest model and use extensive simulationsto demonstrate its scalability and correctness.I.several medical institutions trying to discover certain correlations between symptoms and diagnoses from patients’ records.The former case is known as data mining over verticallypartitioned data [26], where the entire table is partitioned bycolumns and each learner has the same number of records, butthe records are with different features. The latter is known asdata mining over horizontally partitioned data [17], where theentire table is partitioned by rows and each learner has a certainnumber of rows with the same number of features. In eithercase, the handled data is sensitive and usually very large inits volume. In what follows, we will use data mining as thetypical machine learning problems to articulate our proposedalgorithms whenever needed.The problem of privacy-preserving data mining has beenstudied for decades. Numerous privacy-preserving data miningprotocols have been proposed. Generally speaking, privacypreserving data mining protocols could be classified into twocategories: Perturbation and randomization-based approachessanitize the samples prior to their release in order tomitigate the threat of inadvertent disclosure or theft[13][7]. Approaches in this category may only providea limited privacy preservation and usually make atrade-off between utility and privacy. Secure multiparty computation (SMC)-based approaches employ cryptographic tools and focus onprotocol development to protect privacy among theinvolved learners [9][19]. Approaches in this categorymay provide certain level of provable security underthe standard semi-honest model, but may bring in ahuge extra computation overhead[11]. This is especially true in the data mining missions that handleenormous amount of data.I NTRODUCTIONMachine learning has been widely used for scientific research and business purposes recently to extract useful information. Recent decades have witnessed accelerating developmentof new machine learning methods such as clustering, classification, association rule mining and sequence detection, whichhave been constantly improved to discover useful knowledge.However, all existing algorithms are originally designed forcentralized algorithms dealing with small-scale data sets. Intoday’s big data scenarios, things could be very different. Acommon scenario nowadays is that a group of organizationsintend to conduct data mining over their joint data. It could beseveral banks wishing to conduct credit risk analysis to identifynon-profitable customers based on past transaction records, orExisting approaches are designed mostly for a centralizedsolver dealing with small-scale data sets and they may beinadequate for the emerging big data scenarios. Consider thecase where a group of organizations are trying to do dataThis work was partially supported by US National Science Foundation undergrants CNS-1423165 and CNS-1343356.1063-6927/15 31.00 2015 European UnionDOI 10.1109/ICDCS.2015.40318

vertically training sets.Feed back to MappersLearner 1dataMap()Learner 2dataMap()Learner 3dataMap()The rest of this paper is organized as follows: In SectionII we discuss the existing approaches and compare them withour approach. In Section III, we briefly discuss support vectormachines (SVMs), kernel-based nonlinear SVM, and alternating direction method of multipliers (ADMM). In Section IV,we present the designed schemes for horizontally partitionedcase and vertically partitioned case, respectively. In Section V,security analysis is conducted. In Section VI, the performanceof our scheme is tested against three popular data sets andfinally, in Section VII, we conclude this paper.Reduce()Repeat until Reduce() convergeFig. 1: System structureII.mining over their joint data and each share of the data is verylarge in its volume. A desired scheme for this task should havetwo properties. First, it should be able to fit into the existingData Parallel Systems (e.g. Apache Hadoop and/or Twister)to handle the big data. Second, cryptographic operations areminimized to guarantee training efficiency.R ELATED W ORKSPrivacy-preserving data mining falls into two major categories: randomization-based approaches and SMC-based approaches. Randomization-based approaches protect the privatedata by adding noise to the original data. In [21] and [22], boththe horizontally and vertically partitioned data scenarios arestudied for the SVMs. In their schemes, a randomly generatedmatrix known as “random kernel” is directly multiplied to theoriginal kernel matrix. It is argued that random kernels will hidethe private data while guaranteeing accurate learning results.This method works because some properties are preserved bymatrix multiplication which is also known as the restrictedisometry property (RIP) in the compressive sensing research.The drawbacks of the random kernel approaches is that therandom kernels should be shared among the learners as acommon key and it only works under the client and serverscenario. In [7], Chaudhuri and Monteleoni show that one caneither add noise to the final data mining results or directlywork on a perturbed objective function to guarantee thatthe original training data is -differentially private. The differential privacy model limits the amount of informationan adversary can gain about a particular private value byobserving a function learned from a database containing thatvalue, even if she knows every other value in the database.However, this model has a restricted requirement: the objectivefunction and loss function must be differentiable everywherewith continuous derivatives and convex. In [1], Agrawal andSrikant show that adding random noise to the training datastill preserve some statistical properties. As a result, a NaiveBayes classifier could still be obtained from the sanitized data.In [14], Fong and Weber-Jahnke propose a method to generatesome fake training data set from the real ones, while usefulknowledge could still be discovered from the fake training sets.It has been claimed that the privacy and utility are achieved atthe same time. However, the generation of the fake trainingsets induces huge extra cost.Data Parallel Systems are crucial to the success of big data.In Data Parallel Systems such as Google File Sysatem(GFS)and MapReduce, clusters are built with commodity hardwareand each node takes the roles of both computation and storage,which is known as data locality. Data locality is a significantadvantage of data parallel systems over traditional High Performance Computing(HPC) systems because moving computationresult is much cheaper than moving data. We observe thatanother potential big advantage of data locality is its ability toprotect the local private data. Imagine that if machine learningcould be done with distributed local learners independently andsomehow a consensus learning result could be deduced fromthe local learning results, there is no privacy concerns aboutthe local training data because they are controlled by the locallearners and never leave the local domain. For example, inMapReduce framework, local training could be regarded asthe Map() procedures and the consensus forming process bethe Reduce() procedures. With data locality, security bottleneckof the whole system lies in Reduce() because Map() operatesindependently with local data. Notice that Reduce() is usuallymuch simpler than Map(), and then a few simple and efficientcryptographic operations are sufficient. In the rest of this paperwe use Map() and Mapper interchangeably to denote the localtraining procedure; Reduce() and Reducer interchangeably todenote the consensus procedure.In this paper, we propose a framework based on MapReduceand data locality is also used to conceal the local trainingdata. Our system structure is illustrated in Fig. 1. Each learneris treated as a data node of HDFS (Hadoop Distributed FileSystem). Each data node has a Mapper who will performmachine learning over the local data to get a local trainingresult. The local training result is then sent to Reducer. Asecure protocol is implemented on the Reducer such that thelocal training results could be summarized securely. Notice thatit may require a few back and forth negotiation processes forthe local Mappers to reach a consensus. Hence, there shouldexist a feedback channel from the Reducer to Mappers tofeed back the current negotiated results. We focus on supportvector machine and propose two schemes for horizontally andThe SMC-based approaches usually rely on some specificdesign of a certain learning method. For example, in [30], Yuanand Yu use homomorphic encryption to calculate the deltafunction in the back-propagation training; in [28] and [31],Jiang and Zhan propose to use secure dot product protocolsand secure sum protocols to get the kernel matrix in SVMlearning; in [20], Lindell and Pinkas propose a secure protocolto compute the result of (v1 v2 ) log(v1 v2 ) without revealingv1 and v2 to each other, which is the key step for the ID3319

B. Nonlinear SVM with Kernel Tricksalgorithm in building a decision tree; in [18] Kantarcioglu andClifton use commutative encryption and secure sum protocolsto find the association rules over the horizontally partitioneddata and in [27], Vaidya and Clifton apply secure dot protocolsto find association rules over the vertically partitioned data.Generally speaking, SMC based approachs use secure protocolson a few key steps of a particular machine learning algorithmand send the SMC results to a centralized node to performlearning.III.The general task for machine learning is to find andstudy the relationship between data points. Kernel methodsprovide a valuable way to access the similarity in a highdimensional implicit feature space without ever computing thecoordinates in that feature space, which will greatly reduce thecomputation. Specifically, for xi , xj Rk , φ(·) : Rk Rp ,denote the mapping of x from low-dimensional Rk to a highdimensional Reproducing Kernel Hilbert Space (RKHS) Rp .The explicit form of a kernel function K : Rk Rk R isdefined as:K(xi , xj ) φ(xi ), φ(xj )(3)P RELIMINARIESA. Linear Support Vector Machines(SVMs)SVMs were first introduced in [10]. Basically, SVMs areworking towards an optimal separating hyper-plane betweentwo classes of data. The two classes are determined as optimally separated when the margin between them is maximized,which is defined as the distance from the closest point ofone class to the boundary of the other class. The optimizationproblem for SVM could be formulated as1min wT w C ξ 11w,b2s.t. yi (wT xi b) 1 ξiξi 0, iThree most popular kernels, i.e., polynomial, radial basisfunction, and sigmoid kernels, are listed as follows:1)2)3)Nonlinear SVM utilizes kernel tricks and writes its discriminant function as f (x) wT φ(x) bm . However, w is inthe unknown high-dimensional space and is hard to find. Forexample, RBF kernel will lead to an infinite-dimensional w.In that sense, the primary problem of SVM is not available.However, if we follow the standard Wolfe-dual transformationto find the dual problem, we could play with kernel tricksto get rid of the high-dimensional vectors. As shown in [5],nonlinear SVMs could still be solved from problem (2). Theonly difference is that H is obtained by Hij yi K(xi , xj )yj .Then, with the dual variables λ, the discriminant function iswritten as f (x) i SV λi yi K(xi , x).(1)where each xi is one training sample of size k 1 and yiis the corresponding class label. ξ is an N 1 slack variablevector whose entries are all non-negative. The slack variableξ could be used to reject outliers. When the two classes arenot linearly separable, we can use the parameter C to makea tradeoff between maximum margin and the tolerance ofoutliers. Usually, a support vector machine problem (1) isapproached through KKT conditions and its Wolfe-dual, whichis in the following form:1 Tλ Hλ 1T λλ2s.t.0 λ C1,Polynomial Kernel: K(xi , xj ) (a xi , xj b)d2Radial Basis Function Kernel: K(xi , xj ) e xi xj Sigmoid Kernel: K(xi , xj ) tanh( xi , xj c)C. ADMMThe alternating direction method of multipliers (ADMM)is a varian

Machine learning has been widely used for scientific re-search and business purposes recently to extract useful informa-tion. Recent decades have witnessed accelerating development of new machine learning methods such as clustering, classifi-cation, association rule mining and sequence detection, which have been constantly improved to discover useful knowledge. However, all existing .