IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. ?, NO . PDF Free Download

1y ago

27 Views

1 Downloads

4.60 MB

16 Pages

Report/dmca

Download PDF

Transcription

IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. ?, NO. ?, MONTH YEAR1Service-Oriented Architecture forHigh-Dimensional Private Data MashupBenjamin C. M. Fung, Member, IEEE, Thomas Trojer, Patrick C. K. Hung, Member, IEEE,Li Xiong, Member, IEEE, Khalil Al-Hussaeni, and Rachida Dssouli, Member, IEEEAbstract—Mashup is a web technology that allows different service providers to flexibly integrate their expertise and to deliverhighly customizable services to their customers. Data mashup is a special type of mashup application that aims at integratingdata from multiple data providers depending on the user’s request. However, integrating data from multiple sources brings aboutthree challenges: (1) Simply joining multiple private data sets together would reveal the sensitive information to the other dataproviders. (2) The integrated (mashup) data could potentially sharpen the identification of individuals and, therefore, reveal theirperson-specific sensitive information that was not available before the mashup. (3) The mashup data from multiple sourcesoften contain many data attributes. When enforcing a traditional privacy model, such as K-anonymity, the high-dimensional datawould suffer from the problem known as the curse of high dimensionality, resulting in useless data for further data analysis. Inthis paper, we study and resolve a privacy problem in a real-life mashup application for the online advertising industry in socialnetworks, and propose a service-oriented architecture along with a privacy-preserving data mashup algorithm to address theaforementioned challenges. Experiments on real-life data suggest that our proposed architecture and algorithm is effective forsimultaneously preserving both privacy and information utility on the mashup data. To the best of our knowledge, this is the firstwork that integrates high-dimensional data for mashup service.Index Terms—Privacy protection, anonymity, data mashup, data integration, service-oriented architecture, high dimensionalityF1I NTRODUCTIONMASHUP service is a web technology that combines information from multiple sources into asingle web application. An example of a successfulmashup application is the integration of real estateinformation into Google Maps [1], which allows usersto browse on the map for properties that satisfy theirspecified requirements. In this paper, we focus ondata mashup, a special type of mashup application thataims at integrating data from multiple data providersdepending on the service request from a user (adata recipient). An information service request couldbe a general count statistic task or a sophisticateddata mining task such as classification analysis. Uponreceiving a service request, the data mashup webapplication (mashup coordinator) dynamically determines the data providers, collects information fromthem through their web service interface, and then in- B. C. M. Fung and K. Al-Hussaeni are with CIISE, ConcordiaUniversity, Canada. E-mail: {fung, k alhus}@ciise.concordia.ca T. Trojer is with the University of Innsbruck, Austria. E-mail:thomas.trojer@uibk.ac.at P. C. K. Hung is with the University of Ontario Institute of Technology,Canada. E-mail: Patrick.Hung@uoit.ca L. Xiong is with Emory University, USA. E-mail: lxiong@emory.edu R. Dssouli is with CIISE, Concordia University, Canada, and Research Cluster Hire, FIT, United Arab Emirate University. E-mail:dssouli@ciise.concordia.categrates the collected information to fulfill the servicerequest. Further computation and visualization can beperformed at the user’s site or on the web applicationserver. This is very different from a traditional webportal that simply divides a web page or a websiteinto independent sections for displaying informationfrom different sources.A data mashup application can help ordinary usersexplore new knowledge; it could also be misused byadversaries to reveal sensitive information that wasnot available before the mashup. In this paper, westudy the privacy threats caused by data mashup andpropose a service-oriented architecture and a privacypreserving data mashup algorithm to securely integrate person-specific sensitive data from different dataproviders, wherein the integrated data still retainsthe essential information for supporting general dataexploration or a specific data mining task.1.1 The ChallengesThe research problem presented in this paper wasdiscovered in a collaborative project with a socialnetwork company, which focuses on the gay and lesbian community in North America. The problem canbe generalized as follows: social network companiesA and B observe different sets of attributes aboutthe same set of individuals (members) identified bythe common User ID, e.g., TA (U ID, Gender, Salary)and TB (U ID, Job, Age). Every time a social networkmember visits another member’s webpage, an advertisement is chosen to be displayed. Companies A and

IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. ?, NO. ?, MONTH YEARTABLE 1Raw dataSharedData Provider A2TABLE 2Anonymous mashup data (L 2, K 2, C 0.5)Data Provider Technician3458342458442458446363B want to implement a data mashup application thatintegrates their membership data, with the goal ofimproving their advertisement selection strategy. Theanalysis includes gathering general count statisticsand building classification models [2]. In additionto companies A and B, other partnered advertisingcompanies need access to the final mashup data. Thesolution presented in this paper is not limited onlyto the social networks sector but is also applicable toother similar data mashup scenarios. The challengesof developing the data mashup application are summarized as follows:Challenge#1: Privacy concerns. The members arewilling to submit their personal data to a social network company because they consider the companyand its developed system to be trustworthy. Yet, trustto one party may not necessarily be transitive to athird party. Many agencies and companies believe thatprivacy protection means simply removing explicitidentifying information from the released data, suchas name, social security number, address, and telephone number. However, many previous works [3],[4] show that removing explicit identifying information is insufficient. An individual can be re-identifiedby matching other attributes called quasi-identifiers(QID). The following example illustrates potentialprivacy threats.Example 1: Consider the membership data in Table 1. Data Provider A and Data Provider Bown data tables TA (U ID, Class, Sensitive, Gender)and TB (U ID, Class, Job, Age), respectively. Each row(record) represents a member’s information. The twoparties want to develop a data mashup service tointegrate their membership data in order to performclassification analysis on the shared Class attributewith two class labels Y and N , representing whetheror not the member has previously bought any itemsafter following the advertisements on the social network websites. Let QID {Job, Gender, Age}. Afterintegrating the two tables (by matching the sharedU ID field), there are two types of privacy threats:Record linkage: If a record in the table is so specificthat not many members match it, releasing the dataSharedData Provider AData Provider nicalTechnicalAge[30 60)[30 60)[30 60)[1 30)[30 60)[30 60)[1 30)[30 60)[30 60)[60 99)[60 99)may lead to linking the member’s record and his/hersensitive value. Let s1 be a sensitive value in Table 1.Suppose that the adversary knows the target memberis a M over and his age is 34. Hence, record #3,together with his sensitive value (s1 in this case), canbe uniquely identified since he is the only M over whois 34 years old.Attribute linkage: If a sensitive value occurs frequently along with some QID attributes, then thesensitive information can be inferred from such attributes, even though the exact record of the membercannot be identified. Suppose the adversary knowsthat the member is a male (M ) of age 34. In such case,even though there are two such records (#1 and #3),the adversary can infer that the member has sensitivevalue s1 with 100% confidence since both recordscontain s1.Many privacy models, such as K-anonymity [3],[4], ℓ-diversity [5], and confidence bounding [6], havebeen proposed to thwart privacy threats caused byrecord and attribute linkages in the context of relational databases owned by a single data provider. Thetraditional approach is to generalize the records intoequivalence groups so that each group contains atleast K records sharing the same qid value on theQID, and so that each group contains sensitive valuesthat are diversified enough to disorient confidentinferences. The privacy models can be achieved bygeneralizing domain values into higher level conceptsand, therefore, more abstract concepts.The data mashup problem further complicates theprivacy issue because the data is owned by multiple parties. In addition to satisfying a given privacyrequirement in the final mashup data, at any timeduring the process of generalization no data providershould learn more detailed information about anyother data provider other than the data in the finalmashup table. In other words, the generalization process must not leak more specific information otherthan the final mashup data. For example, if the final table discloses that a member is a P rof essional,then no other data providers should learn whethershe is a Lawyer or an Accountant. There are two

IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. ?, NO. ?, MONTH YEARobvious yet incorrect approaches. The first one ismashup-then-generalize: first integrate the two tablesand then generalize the mashup table using somesingle table anonymization methods [7], [8], [9], [10].This approach does not preserve privacy in the studied scenario because any data provider holding themashup table will immediately know all private information of both data providers. The second approachis generalize-then-mashup: first generalize each tablelocally and then integrate the generalized tables. Thisapproach fails to guarantee the privacy for a quasiidentifier that spans multiple tables. In the aboveexample, the K-anonymity on (Gender, Job) cannotbe achieved by the K-anonymity on each of Genderand Job separately.Challenge#2: High dimensionality. The mashupdata from multiple data providers usually containmany attributes. Enforcing traditional privacy modelson high-dimensional data would result in significant information loss. As the number of attributesincreases, more generalization is required in orderto achieve K-anonymity even if K is small, therebyresulting in data useless for further analysis. This challenge, known as the curse of high dimensionality on Kanonymity, is confirmed by [8], [11], [12]. To overcomethis bottleneck, we exploit one of the limitations ofthe adversary: in real-life privacy attacks, it is verydifficult for an adversary to acquire all the informationof a target victim because it requires non-trivial effortto gather each piece. Thus, it is reasonable to assumethat the adversary’s prior knowledge is bounded byat most L values of the QID attributes. Based onthis assumption, in this paper we extend the privacymodel called LKC-privacy [13], originally proposedfor a single party scenario, to apply to a multiparty datamashup scenario.The general intuition of LKC-privacy is to ensurethat every combination of values in QIDj QIDwith maximum length L in the data table T is sharedby at least K records, and the confidence of inferringany sensitive values in S is not greater than C, whereL, K, C are thresholds and S is a set of sensitivevalues specified by the data provider. LKC-privacylimits the probability of a successful record linkage tobe 1/K and the probability of a successful attributelinkage to be C, provided that the adversary’s priorknowledge does not exceed L.Table 2 shows an example of an anonymous tablethat satisfies (2, 2, 50%)-privacy by generalizing thevalues from Table 1 according to the taxonomiesin Figure 1. (The dashed curve can be ignored fornow.) Every possible value of QIDj with maximumlength 2 in Table 2 (namely, QID1 {Job, Gender},QID2 {Job, Age}, and QID3 {Gender, Age}) isshared by at least 2 records, and the confidence ofinferring the sensitive value s1 is not greater than50%. In contrast, enforcing traditional 2-anonymitywith respect to QID {Gender, Job, Age} will re-3quire further generalization. For example, in orderto make ⟨P rof essional, M, [30-60)⟩ satisfy traditional2-anonymity, we may further generalize [1-30) and[30-60) to [1-60), resulting in much higher informationutility loss.Challenge#3: Information requirements. The datarecipients want to obtain general count statistics fromthe mashup membership information. Also, they wantto use the mashup data as training data for buildinga classification model on the Class attribute, with thegoal of predicting the behavior of future members.One frequently raised question is: to avoid privacyconcerns, why doesn’t the data provider release thestatistical data or a classifier to the data recipients? Inmany real-life scenarios, releasing data is preferable toreleasing statistics for several reasons. First, the dataproviders may not have in-house experts to performdata mining. They just want to share the data withtheir partners. Second, having access to the data,data recipients are flexible to perform the requireddata analysis. It is impractical to continuously requestdata providers to produce different types of statisticalinformation or to fine-tune the data mining results forresearch purposes for the data recipients.1.2 ContributionsThis paper is the first work that addresses all theaforementioned challenges in the context of mashupservice. The contributions are summarized as follows.Contribution#1: We identify a new privacy problem through a collaboration with the social networksindustry and generalize the industry’s requirementsto formulate the privacy-preserving high-dimensionaldata mashup problem (Section 3). The problem is todynamically integrate data from different sources forjoint data analysis in the presence of privacy concerns.Contribution#2: We present a service-oriented architecture (Section 4) for privacy-preserving datamashup in order to securely integrate private datafrom multiple parties. The generalized data has tobe as useful as possible to data analysis. Generally speaking, the privacy goal requires anonymizing identifying information that is specific enoughto pinpoint individuals, whereas the data analysisgoal requires extracting general trends and patterns. Ifgeneralization is carefully performed, it is possible toanonymize identifying information while preservinguseful patterns.Contribution#3: Data mashup often involves alarge volume of data from multiple data sources.Thus, scalability plays a key role in a data mashupsystem. After receiving a request from a data recipient, the system dynamically identifies the dataproviders and performs the data mashup. Experimental results (Section 5) on real-life data suggest that ourmethod can effectively achieve a privacy requirementwithout compromising the information utility, and theproposed architecture is scalable to large data sets.

IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. ?, NO. ?, MONTH Janitor Mover Carpenter -60)Fig. 1. Taxonomy trees and QIDs2R ELATED W ORKInformation integration has been an active area ofdatabase research [15], [16]. This literature typicallyassumes that all information in each database canbe freely shared [17]. Secure multiparty computation(SMC) [18], [19], [20], on the other hand, allows sharing of the computed result (e.g., a classifier), but completely prohibits sharing of data. An example is thesecure multiparty computation of classifiers [21], [22],[23]. In contrast, the privacy-preserving data mashupproblem studied in this paper allows data providers toshare data, not only the data mining results. In manyapplications, data sharing gives greater flexibility thanresult sharing because the data recipients can performtheir required analysis and data exploration [8].Samarati and Sweeney [24] propose the notion ofK-anonymity. Datafly system [4] and µ-Argus system [25] use generalization to achieve K-anonymity.Preserving classification information in K-anonymousdata is studied in [8], [10]. Mohammed et al. [13]extend the work to address the problem of highdimensional anonymization for the healthcare sectorusing LKC-privacy. All these works consider a singledata source; therefore, data mashup is not an issue.Joining all private databases from multiple sourcesand applying a single table anonymization methodfails to guarantee privacy if a QID spans across multiple private tables. Recently, Mohammed et al. [14]propose an algorithm to address the horizontal integration problem, while our paper addresses thevertical integration problem.Jiang and Clifton [26], [27] propose a cryptographicapproach and Mohammed et al. [28] propose a topdown specialization algorithm to securely integratetwo vertically-partitioned distributed data tables to aK-anonymous table, and further consider the participation of malicious parties in [29]. Trojer et al. [30]present a service-oriented architecture for achievingK-anonymity in the privacy-preserving data mashupscenario. Our paper is different from these previousworks [26], [27], [28], [29], [30] in two aspects. First,our LKC-privacy model provides a stronger privacyguarantee than K-anonymity because K-anonymitydoes not address the privacy attacks caused by attribute linkages, as discussed in Section 1. Second,our method can better preserve information utility inhigh-dimensional mashup data. High dimensionality isa critical obstacle for achieving effective data mashupbecause the integrated data from multiple partiesusually contain many attributes. Enforcing traditionalK-anonymity on high-dimensional data will resultin significant information loss. Our privacy modelresolves the problem of high dimensionality. Thisclaim is also supported by our experimental results.Yang et al. [23] develop a cryptographic approach tolearn classification rules from a large number of dataproviders while sensitive attributes are protected. Theproblem can be viewed as a horizontally partitioneddata table in which each transaction is owned by adifferent data provider. The output of their method isa classifier, but the output of our method is an anonymous mashup data that supports general data analysis or classification analysis. Jurczyk and Xiong [31],[32] present a privacy-preserving distributed datapublishing for horizontally partitioned databases. Themashup model studied in this paper can be viewedas a vertically partitioned data table, which is verydifferent from the model studied in [23], [31], [32].Jackson and Wang [33] present a secure communication mechanism that enables cross-domain networkrequests and client-side communication with the goalof protecting the mashup controller from maliciouscode through web services. In contrast, this paperaims to preserve the privacy and information utilityof the mashup data.3P ROBLEM D EFINITIONWe first define the LKC-privacy model [13] andthe information utility measure on a single datatable, then extend it for privacy-preserving highdimensional data mashup from multiple parties.3.1 Privacy MeasureConsiderarelationaldatatableT (U ID, D1 , . . . , Dm , S1 , . . . , Se , Class) (e.g., Table 1).U ID is an explicit identifier, such as User ID orSSN. In practice, it should be replaced by a pseudoidentifier, such as a record ID, before publication. Weuse U ID to ease the discussion only. Each Di is eithera categorical or numerical attribute. Each Sj is acategorical sensitive attribute. A record has the form⟨v1 , . . . , vm , s1 , . . . , se , cls⟩, where vi is a domain valuein Di , sj is a sensitive value in Sj , and cls is a class

IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. ?, NO. ?, MONTH YEARvalue in Class. The data provider wants to protectagainst linking an individual to a record or somesensitive value in T through some subset of attributescalled a quasi-identifier QID {D1 , . . . , Dm }.One data recipient, who is an adversary, seeks toidentify the record or sensitive values of some targetvictim V in T . As explained in Section 1, we assumethat the adversary knows at most L values of QIDattributes of the victim. We use qid to denote suchprior known values, where qid L. Based on theprior knowledge qid, the adversary could identify agroup of records, denoted by T [qid], that contains qid. T [qid] denotes the number of records in T [qid]. Theadversary could launch two types of privacy attacksbased on T [qid]: Record linkage: Given prior knowledge qid, T [qid]is a set of candidate records that contains the victim V ’s record. If the group size of T [qid], denotedby T [qid] , is small, then the adversary mayidentify V ’s record from T [qid] and, therefore, V ’ssensitive value. For example, if qid ⟨M over, 34⟩in Table 1, T [qid] {U ID#3} and T [qid] 1.Thus, the adversary can easily infer that V hassensitive value s1. Attribute linkage: Given prior knowledge qid, theadversary can identify T [qid] and infer that Vhas sensitive value s with confidence P (s qid) T [qid s] T [qid] , where T [qid s] denotes the set ofrecords containing both qid and s. P (s qid) is thepercentage of the records in T [qid] containing s.The privacy of V is at risk if P (s qid) is high. Forexample, given qid ⟨M, 34⟩ in Table 1, T [qid s1] {U ID#1, 3} and T [qid] {U ID#1, 3},hence P (s1 qid) 2/2 100%.To thwart the record and attribute linkages on anyindividual in the table T , we require every qid witha maximum length L in the anonymous table to beshared by at least a certain number of records, and thepercentage of sensitive value(s) in every group cannotbe too high. The privacy model, LKC-privacy [13],reflects this intuition.Definition 3.1 (LKC-privacy): Let L be the maximum number of values of the adversary’s priorknowledge. Let S Sj be a set of sensitive values.A data table T satisfies LKC-privacy if and only if forany qid with qid L,1) T [qid] K, where K 0 is an integer representing the anonymity threshold, and2) P (s qid) C for any s S, where 0 C 1 is a real number representing the confidencethreshold.The data provider specifies the thresholds L, K, andC. The maximum length L reflects the assumption ofthe adversary’s power. LKC-privacy guarantees theprobability of a successful record linkage to be 1/Kand the probability of a successful attribute linkage tobe C. Sometimes, we write C in percentage. LKC-5privacy has several nice properties that make it suitable for anonymizing high-dimensional data. First, itonly requires a subset of QID attributes to be sharedby at least K records. This is a major relaxation fromtraditional K-anonymity, based on a very reasonableassumption that the adversary has limited power.Second, LKC-privacy generalizes several traditionalprivacy models. K-anonymity [3], [4] is a special caseof LKC-privacy with L QID and C 100%,where QID is the number of QID attributes in thedata table. Confidence bounding [6] is also a specialcase of LKC-privacy with L QID and K 1.(α, k)-anonymity [34] is a special case of LKC-privacywith L QID , K k, and C α. One instantiationof ℓ-diversity is also a special case of LKC-privacywith L QID , K 1, and C 1/ℓ. Thus, the dataprovider can still achieve the traditional models.3.2 Utility MeasureThe measure of information utility varies dependingon the user’s specified information service requestand the data analysis task to be performed on themashup data. Based on the information requirementsspecified by the social network data providers, wedefine two utility measures. First, we aim at preserving the maximal information for classification analysis. Second, we aim at minimizing the overall datadistortion when the data analysis task is unknown.In this paper, the general idea in anonymizinga table is to perform a sequence of specializationsstarting from the topmost general state in which eachattribute has the topmost value of its taxonomy tree.We assume that a taxonomy tree is specified for eachcategorical attribute in QID. A leaf node representsa domain value and a parent node represents a lessspecific value. For numerical attributes in QID, taxonomy trees can be grown at runtime, where each noderepresents an interval, and each non-leaf node has twochild nodes representing some optimal binary split ofthe parent interval [2]. Figure 1 shows a dynamicallygrown taxonomy tree for Age.A specialization, written v child(v), where child(v)denotes the set of child values of v, replaces the parentvalue v with the child value that generalizes the domain value in a record. A specialization is valid if theresulting table still satisfies the specified LKC-privacyrequirement after the specialization. A specializationis performed only if it is valid. The specializationprocess can be viewed as pushing the ”cut” of eachtaxonomy tree downwards. A cut of the taxonomytree for an attribute Di , denoted by Cuti , containsexactly one value on each root-to-leaf path. Figure 1shows a solution cut indicated by the dashed curverepresenting the LKC-privacy preserved Table 2. Thespecialization procedures start from the topmost cutand iteratively pushes down the cut by specializingsome value in the current cut until violating the LKC-

IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. ?, NO. ?, MONTH YEARprivacy requirement. In other words, the specialization process pushes the cut downwards until no validspecialization is possible. Each specialization tendsto increase information utility and decrease privacybecause records are more distinguishable by specificvalues. We define two utility measures (scores) toevaluate the ”goodness” of a specialization dependingon the information service request requirement.3.2.1 Utility Measure for Classification AnalysisFor the requirement of classification analysis, we useinformation gain [2] to measure the goodness of a specialization. Let T [x] denote the set of records in tableT generalized to the value x. Let T [x cls] denote thenumber of recordsin T [x] having the class cls. Note that T [v] c T [c] , where c child(v). Our selection criterion, Score(v), is to favor the specializationv child(v) that has the maximum information gain:Score(v) E(T [v]) T [c] E(T [c]). T [v] cwhere E(T [x]) is the entropy [35] of T [x] and: T [x cls] T [x cls] E(T [x]) log2, T [x] T [x] (1)(2)clsIntuitively, E(T [x]) measures the mix of classes forthe records in T [x], and the information gain of v (orScore(v) in this case) is the reduction of the mix byspecializing v into c child(v).For a numerical attribute, the specialization of aninterval refers to the optimal binary split that maximizes information gain on the Class attribute. See [2]for details.3.2.2 Utility Measure for General Data AnalysisSometimes, the mashup data is shared without a specific task. In this case of general data analysis, we usediscernibility cost [36] to measure the data distortion inthe anonymous data. The discernibility cost chargesa penalty to each record for being indistinguishablefrom other records. For each record in an equivalencegroup qid, the penalty is T [qid] . Thus, the penaltyon a group is T [qid] 2 . To minimize the discernibilitycost, we choose the specialization v child(v) thatmaximizes the value of Score(v) T [qidv ] 2(3)6records. U ID and Class are shared attributes amongall data providers. QIDy is a set of quasi-identifyingattributes and Sy is a set of sensitive values ownedby provider y. QIDy QIDz and Sy Sz forany 1 y, z n. These providers agree to release”minimal information” to form a mashup table T(by matching the U ID) for conducting general dataanalysis or a joint classification analysis. The notion ofminimal information is specified by an LKC-privacyrequirement on the mashup table. A QIDj is local ifall attributes in QIDj are owned by one provider;otherwise, it is global.Definition 3.2: (Privacy-Preserving High-DimensionalData Mashup): Given multiple private tablesT1 , . . . , Tn ,anLKC-privacyrequirement,QID QIDy , S Sy , and a taxonomy treefor each categorical attribute in QID, the problem ofprivacy-preserving high-dimensional data mashup is toefficiently produce a generalized integrated (mashup)table T such that (1) T satisfies the LKC-privacyrequirement, (2) T contains as much information aspossible for general data analysis or classificationanalysis, (3) each data provider learns nothing aboutthe other providers’ data more specific than what isin the final mashup table T . We assume that the dataproviders are semi-honest [18], [19], [20], meaningthat they will follow the algorithm but may attemptto derive sensitive information from the receiveddata.We use an example to explain condition (3). If arecord in the final table T has values F and Professionalon Gender and Job, respectively. Condition (3) isviolated if Provider A can learn that P rof essional inthis record comes from Lawyer. Our privacy modelensures the privacy protection in the final mashuptable as well as in any intermediate tables. To easethe explanation, we present our solution in a scenarioof two parties (n 2). A discussion is given in Section 4.3 to describe the extension to multiple parties.()Giv