Bootstrapping Pay-As-You-Go Data Integration Systems PDF Free Download

1y ago

46 Views

1 Downloads

808.06 KB

14 Pages

Report/dmca

Download PDF

Transcription

Bootstrapping Pay-As-You-Go Data Integration Systems Anish Das Sarma†Xin Dong†Alon HalevyStanford UniversityCalifornia, USAAT&T Labs–ResearchNew Jersey, USAGoogle Inc.California, USAanish@cs.stanford.edulunadong@research.att.com halevy@google.comABSTRACT1. INTRODUCTIONData integration systems offer a uniform interface to a set of datasources. Despite recent progress, setting up and maintaining adata integration application still requires significant upfront effortof creating a mediated schema and semantic mappings from thedata sources to the mediated schema. Many application contextsinvolving multiple data sources (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services, motivating a pay-as-you-goapproach to integration. With that approach, a system starts withvery few (or inaccurate) semantic mappings and these mappingsare improved over time as deemed necessary.This paper describes the first completely self-configuring dataintegration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system.Our system is based on the new concept of a probabilistic mediated schema that is automatically created from the data sources.We automatically create probabilistic schema mappings betweenthe sources and the mediated schema. We describe experimentsin multiple domains, including 50-800 data sources, and show thatour system is able to produce high-quality answers with no humanintervention.Data integration systems offer a single-point interface to a setof data sources. A data integration application is typically built bycreating a mediated schema for the domain at hand, and creating semantic mappings between the schemas of the data sources and themediated schema. The user (or application) poses queries using theterminology of the mediated schema, and the query is reformulatedonto the sources using the semantic mappings.Despite recent progress in the field, setting up and maintaininga data integration application still requires significant upfront andongoing effort. Hence, reducing the effort required to set up a dataintegration application, often referred to as on-the-fly integration,has been a recurring challenge for the field. In fact, as pointed outin [12], many application contexts (e.g., the web, personal information management, enterprise intranets) do not require full integration in order to provide useful services. This observation ledto proposing a pay-as-you-go approach to integration, where thesystem starts with very few (or inaccurate) semantic mappings andthese mappings are improved over time as deemed necessary.This paper describes the first completely self-configuring dataintegration system. The goal of our work is to investigate how advanced of a starting point we can provide a pay-as-you-go system,and how well a completely automated system can perform. Weevaluate our system on several domains, each consisting of 50-800heterogeneous tables obtained from the Web. The key contributionof the paper is that we can obtain very good query precision andrecall compared to the alternatives of (1) treating all the sources astext, or (2) performing full manual integration.To completely automate data integration, we need to automatically create a mediated schema from the sources and automatically create semantic mappings between the sources and the mediated schema. Automatic creation of schema mappings has receivedconsiderable attention [5, 7, 8, 9, 13, 14, 18, 21, 23, 26, 28]. Recently [10] introduced the notion of probabilistic schema mappingswhich provides a foundation for answering queries in a data integration system with uncertainty about semi-automatically createdmappings. To complete the puzzle, we show how to automaticallycreate a mediated schema from a set of data sources.The specific contributions of the paper are the following. First,we show how to automatically create a mediated schema from a setof data sources. In doing so, we introduce the concept of a probabilistic mediated schema, which is a set of mediated schemas withprobabilities attached to each. We show that probabilistic mediatedschemas offer benefits in modeling uncertainty about the semanticsof attributes in the sources. We describe how to create a deterministic mediated schema from the probabilistic one, which is theschema exposed to the user.Our second contribution is an algorithm for creating probabilis-Categories and Subject DescriptorsH [Information Systems]General TermsAlgorithms, Design, ExperimentationKeywordsdata integration, pay-as-you-go, mediated schema, schema mapping This work was supported by a Microsoft Graduate Fellowship andby NSF under grants IIS-0415175, IIS-0324431 and IIS-0414762.†Work was partially done while these authors were at Google.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD’08, June 9–12, 2008, Vancouver, BC, Canada.Copyright 2008 ACM 978-1-60558-102-6/08/06 . 5.00.

tic schema mappings from the data sources to the mediated schema.Since a mapping is constructed from a set of weighted attribute correspondences between a source schema and the mediated schema,and such weighted correspondences do not uniquely determine asemantic mapping [10], we construct a probabilistic mapping thatis consistent with the correspondences and obtains the maximal entropy.As our final contribution, we describe a set of experimental results establishing the efficacy of our algorithms. We compare theprecision and recall of our system with several alternative approachesincluding: (1) a perfect integration where the mediated schema andmappings are created manually, (2) a document-centric approachwhere we perform keyword search on the sources, and (3) posing queries directly on the sources without a mediated schema. Weshow that our automatic methods achieve F-measure of around 0.92compared to (1) and significantly outperform (2) and (3) and several variants of our algorithms. Hence, we believe that our approachcan substantially reduce the amount of time taken to create a dataintegration application.The paper is organized as follows. Section 2 gives an overview ofour approach. Section 3 formally introduces the notion of a probabilistic mediated schema and presents some basic results aboutthem. Sections 4–6 present algorithms for the various steps insetting up a data integration system: constructing the probabilistic mediated schema, generating probabilistic mappings for eachsource, and consolidating the schema and mappings, respectively.Section 7 provides an experimental evaluation of our system. Section 8 presents related work and we conclude in Section 9.2.OVERVIEWWe begin with an overview of our approach and point out thetechnical challenges we address in the paper. Creating a data integration application involves two activities that require significanthuman effort: creating the mediated schema and creating semantic mappings between the data sources and the mediated schema.Both activities require knowledge of the domain as well as an understanding of the queries that can be frequently asked.Our goal is to create a data integration application without anyhuman involvement. The resulting integration should give besteffort answers, and should let the administrator improve the systemin a pay-as-you-go fashion. To accomplish this goal, we need to automatically create a mediated schema and semantic mappings fromthe sources to that schema.To support best-effort answers and improvement over time, webuild our system on a probabilistic data model. Recent work has introduced probabilistic schema mappings [10], which enable a dataintegration system to model its uncertainty on which schema mapping is correct. While we define probabilistic schema mappingsformally in the next section, intuitively, a probabilistic schema mapping consists of a set of mappings with a probability attached toeach mapping. Previous research has also considered the problemof automatically creating (non-probabilistic) schema mappings.The mediated schema in a data integration application consists ofthe set of relations and attributes that we wish to expose to users ofthe system. The mediated schema need not include all the attributesthat appear in the sources, nor does it include only the intersectionof the attributes that appear in all of the sources.To build a mediated schema automatically, a natural strategy isto start from attributes in the source schemas, group those that havethe same meaning, and consider each result group as an attribute inthe mediated schema. Because of the heterogeneity of the sources,we are typically unsure of the semantics of the source attributes andin turn of the clustering results. Furthermore, since attributes canhave ambiguous meanings and some attributes can overlap in theirmeaning, this approach to creating a mediated schema results in asignificant amount of uncertainty.Our approach is based on constructing a probabilistic mediatedschema. The following example illustrates the advantages of aprobabilistic mediated schema in our setting.E XAMPLE 2.1. Consider two source schemas both describingpeople:S1(name, hPhone, hAddr, oPhone, oAddr)S2(name, phone, address)In S2, the attribute phone can either be a home phone numberor be an office phone number. Similarly, address can either be ahome address or be an office address.Suppose we cluster the attributes of S1 and S2. There are multiple ways to cluster the attributes and they correspond to differentmediated schemas. Below we list a few (in the mediated schemaswe abbreviate hPhone as hP, oPhone as oP, hAddr as hA, andoAddr as oA):M1({name}, {phone, hP, oP}, {address, hA, oA})M2({name}, {phone, hP}, {oP}, {address, oA}, {hA})M3({name}, {phone, hP}, {oP}, {address, hA}, {oA})M4({name}, {phone, oP}, {hP}, {address, oA}, {hA})M5({name}, {phone}, {hP}, {oP}, {address}, {hA}, {oA})None of the listed mediated schemas is perfect. Schema M1groups multiple attributes from S1. M2 seems inconsistent becausephone is grouped with hPhone while address is grouped withoAddress. Schemas M3 , M4 and M5 are partially correct butnone of them captures the fact that phone and address can beeither home phone and home address, or office phone and officeaddress.Even if we introduce probabilistic schema mappings, none of thelisted mediated schemas will return ideal answers. For example,using M1 prohibits returning correct answers for queries that contain both hPhone and oPhone because they are taken to be thesame attribute. As another example, consider a query that containsphone and address. Using M3 or M4 as the mediated schemawill unnecessarily favor home address and phone over office address and phone or vice versa. A system with M2 will incorrectlyfavor answers that return a person’s home address together withoffice phone number. A system with M5 will also return a person’shome address together with office phone, and does not distinguishsuch answers from answers with correct correlations.A probabilistic mediated schema will avoid this problem. Consider a probabilistic mediated schema M that includes M3 andM4 , each with probability .5. For each of them and each sourceschema, we generate a probabilistic mapping (Section 5). For example, the set of probabilistic mappings pM for S1 is shown inFigure 1(a) and (b).Now consider an instance of S1 with a tuple(’Alice’, ’123-4567’, ’123, A Ave.’,’765-4321’, ’456, B Ave.’)and a querySELECT name, phone, addressFROM PeopleThe answer generated by our system with respect to M and pMis shown in Figure 1(c). (As we describe in detail in Section 3, we

Possible Mapping{(name, name), (hP, hPP), (oP, oP),(hA, hAA), (oA, oA)}{(name, name), (hP, hPP), (oP, oP),(oA, hAA), (hA, oA)}{(name, name), (oP, hPP), (hP, oP),(hA, hAA), (oA, oA)}{(name, name), (oP, hPP), (hP, oP),(oA, hAA), (hA, oA)}(a)Possible Mapping{(name, name), (oP, oPP), (hP, hP),(oA, oAA), (hA, hA)}{(name, name), (oP, oPP), (hP, hP),(hA, oAA), (oA, hA)}{(name, name), (hP, oPP), (oP, hP),(oA, oAA), (hA, hA)}{(name, name), (hP, oPP), (oP, hP),(hA, oAA), (oA, hA)}(b)Answer(’Alice’, ’123-4567’, ’123, A Ave.’)(’Alice’, ’765-4321’, ’456, B Ave.’)(’Alice’, ’765-4321’, ’123, A Ave.’)(’Alice’, ’123-4567’, ’456, B Ave.’)(c)ProbabilityUser Interface0.640.16Med-Schema & 16Figure 1: The motivating example: (a) p-mapping for S1 andM3 , (b) p-mapping for S1 and M4 , and (c) query answers w.r.t.M and pM. Here we denote {phone, hP} by hPP, {phone,oP} by oPP, {address, hA} by hAA, and {address, oA} byoAA.allow users to compose queries using any attribute in the source.)Compared with using one of M2 to M5 as a mediated schema, ourmethod generates better query results in that (1) it treats answerswith home address and home phone and answers with office address and office phone equally, and (2) it favors answers with thecorrect correlation between address and phone number.2Building on the concept of a probabilistic mediated schema, ourapproach consists of three steps:Construct a probabilistic mediated schema: We begin by constructing a set of schemas with a probability associated with eachone (Section 4).Find best probabilistic schema mappings: Given the probabilistic mediated schema, we need to construct the appropriate semanticmappings (Section 5). The key challenge in this step is that the toolsfor automatic schema mapping typically produce weighted correspondences between pairs of attributes (one from each schema).But such correspondences neither uniquely determine a specificschema mapping, nor uniquely determine a distribution of possible schema mappings. Therefore, we need to choose one distribution that seems to best capture the automatically generated attributecorrespondences.Create a single mediated schema to expose to the user: In thisstep we create a single mediated schema for the user and createsemantic mappings to it by adjusting the mappings created in theprevious step (Section 6). The consolidated mediated schema issuch that it returns the same answers as we would have obtainedover the probabilistic mediated schema.S2S3SnS4Data SourcesFigure 2: Architecture of our automatic-setup data integrationsystem.This step is not strictly necessary. For example, in some situations we may prefer to present the user with the set of mediatedschemas and have her choose one that best suits the application’sneeds. We also show that under certain conditions, a probabilisticmediated schema actually adds expressive power to the system.Figure 2 depicts the architecture of our system. At set-up time,we start with attribute matching, based on which we generate theprobabilistic mediated schema and mappings. We then consolidatethem and generate the final mediated schema and mappings. Atquery-answering time, for each data source we rewrite a queryaccording to the mappings and answer the rewritten queries onthe source data. We then combine the results from different datasources by taking the disjunction of the probabilities of each answer tuple; that is, if answer t has probability pi , i [1, n], for thei-th data source, the final probability of t is 1 Πni 1 (1 pi ). Herewe assume that the different data sources are independent. Dealing with data sources where some may be derived from others isbeyond the scope of this paper.3. PROBABILISTIC MEDIATED SCHEMASIn this section we formally define probabilistic mediated schemasand the semantics of queries posed over them. We also show precisely the relationship between probabilistic mediated schemas anddeterministic (i.e., non-probabilistic) mediated schemas.In our discussion, we consider a set of source schemas {S1 , . . . ,Sn } that are assumed to be roughly from the same domain. Weconsider the case where each schema contains a single table with aset of attributes. We denote the attributes in schema Si , i [1, n],by attr(Si ), and the set of all source attributes as A. That is,A attr(S1 ) · · · attr(Sn ). We focus on this case because it already exposes many interesting problems and is a common case inpractice (e.g., integration on the web); we describe the challengesin integrating multiple-table sources in future work (Section 9).We begin with deterministic mediated schemas. We denote a mediated schema for a set of sources {S1 , . . . , Sn } by M {A1 , . . . ,

Am }, where each of the Ai ’s is called a mediated attribute. Themediated attributes are sets of attributes from the sources, i.e., Ai A; for each i, j [1, m], i 6 j Ai Aj .Note that whereas in a traditional mediated schema an attributehas a name, we do not deal with naming of an attribute in ourmediated schema and allow users to use any source attribute intheir queries. (In practice, we can use the most frequent sourceattribute to represent a mediated attribute when exposing the mediated schema to users.) If a query contains an attribute a Ai , i [1, m], then when answering the query we replace a everywherewith Ai .A probabilistic mediated schema consists of a set of mediatedschemas, each with a probability indicating the likelihood that theschema correctly describes the domain of the sources. We formallydefine probabilistic mediated schemas as follows.D EFINITION 3.1 (P ROBABILISTIC M EDIATED S CHEMA ). Let{S1 , . . . , Sn } be a set of schemas. A probabilistic mediated schema(p-med-schema) for {S1 , . . . , Sn } is a setM {(M1 , P r(M1 )), . . . , (Ml , P r(Ml ))}where for each i [1, l], Mi is a mediated schema for S1 , . . . , Sn ,and for each i, j [1, l], i 6 j, Mi and Mj correspond todifferent clusterings of the source attributes; P r(Mi ) (0, 1], and Σli 1 P r(Mi ) 1.2Probabilistic schema mappings: Before we can define the semantics of answers posed over mediated schemas, we review thedefinition of probabilistic schema mappings, originally introducedin [10]. In this paper we mostly consider one-to-one schema mappings. Given a mediated schema M and a data source S, a schemamapping consists of a set of attribute correspondences, where eachcorrespondence matches a source attribute in S to an attribute inthe mediated schema M . The mapping is one-to-one if each of theattributes of the source or the mediated schema is involved in atmost one attribute correspondence.A probabilistic schema mapping describes a probabilistic distribution of possible mappings between a source and a mediatedschema. Formally, they are defined as follows:D EFINITION 3.2 (P ROBABILISTIC M APPING ). Let S be asource schema and M be a mediated schema. A probabilistic schemamapping (p-mapping) between S and M is a setpM {(m1 , P r(m1 )), . . . , (ml , P r(ml ))}such that for each i [1, l], mi is a schema mapping between S andM , and for every i, j [1, l], i 6 j mi 6 mj ; P r(mi ) (0, 1], and Σli 1 P r(mi ) 1.2We focus on one-to-one mappings because they are common inpractice and it is more feasible to generate such mappings thanmore complex mappings. As we show later, our algorithm actually produces one-to-many schema mappings when it consolidatesa probabilistic mediated schema into a deterministic one. A one-tomany mapping maps a source attribute to a set (e.g., concatenation)of attributes in the mediated schema.Semantics of queries: We measure the quality of the p-med-schemaand the p-mappings we generate by the accuracy of query answering results. Our goal is to return all correct answers possibly withwrong answers, but rank correct answers higher. That is, we wantto obtain high precision, recall and high Top-k precision.However, before we can answer queries in this setting, we needto define the semantics of queries. We define the semantics of ap-med-schema by defining query answering with respect to a pmed-schema and a set of p-mappings. Our definition is the naturalextension of query answering with respect to p-mappings [10].We consider select-project-join (SPJ) queries, a core set of SQLqueries. Answering queries with respect to p-mappings returns aset of answer tuples, each with a probability indicating the likelihood that the tuple occurs as an answer. In this paper we consider by-table semantics, which assumes there is one single possible mapping that is correct and it applies to all tuples in the sourcetable. Given a query Q, we compute answers by first answering Qwith respect to each possible mapping, and then for each answertuple t summing up the probabilities of the mappings with respectto which t is generated.We now extend this notion for query answering that takes p-medschema into consideration. Intuitively, we compute query answersby first answering the query with respect to each possible mediatedschema, and then for each answer tuple taking the sum of its probabilities weighted by the probabilities of the mediated schemas.D EFINITION 3.3 (Q UERY A NSWER ). Let S be a source schemaand M {(M1 , P r(M1 )), . . . , (Ml , P r(Ml ))} be a p-med-schema.Let pM {pM (M1 ), . . . , pM (Ml )} be a set of p-mappingswhere pM (Mi ) is the p-mapping between S and Mi . Let D bean instance of S and Q be a query.Let t be a tuple. Let P r(t Mi ), i [1, l], be the probabilityof t in the answer of Q with respect to Mi and pM (Mi ). Letp Σli 1 P r(t Mi ) P r(Mi ). If p 0, then we say (t, p) is aby-table answer with respect to M and pM.We denote all by-table answers by QM,pM (D).2We say that query answers A1 and A2 are equal (denoted A1 A2 ) if A1 and A2 contain exactly the same set of tuples with thesame probability assignments.Expressive power: A natural question to ask at this point is whetherprobabilistic mediated schemas provide any added expressive powercompared to deterministic ones. Theorem 3.4 shows that if we consider one-to-many schema mappings, where one source attributecan be mapped to multiple mediated attributes, then any combination of a p-med-schema and p-mappings can be equivalently represented using a deterministic mediated schema with p-mappings,but may not be represented using a p-med-schema with deterministic schema mappings. Note that we can easily extend the definition of query answers to one-to-many mappings as one mediatedattribute can correspond to no more than one source attribute. (Tomaintain the flow of the paper, we provide only proof sketches forsome theorems in the body of the paper, and defer complete proofsto the appendix.)T HEOREM 3.4 (S UBSUMPTION ).1. Given a source schemaS, a p-med-schema M, and a set of p-mappings pM between S and possible mediated schemas in M, there exists a deterministic mediated schema T and a p-mappingpM between S and T , such that D, Q : QM,pM (D) QT,pM (D).2. There exists a source schema S, a mediated schema T , a pmapping pM between S and T , and an instance D of S, suchthat for any p-med-schema M and any set m of deterministicmappings between S and possible mediated schemas in M,there exists a query Q such that QM,m (D) 6 QT,pM (D).2

Proof sketch: To prove (1), we show that we can create a singlenew mediated schema T , and rewrite each original schema mapping in pM between S and a mediated schema in M to a corresponding schema mapping between S and T . For the second part,we give an example S, T , and a p-mapping between them such thatno p-med-schema with deterministic mappings can represent it. 2In contrast, Theorem 3.5 shows that if we restrict our attention toone-to-one mappings, then a probabilistic mediated schema doesadd expressive power.T HEOREM 3.5. There exists a source schema S, a p-med-schemaM, a set of one-to-one p-mappings pM between S and possible mediated schemas in M, and an instance D of S, such thatfor any deterministic mediated schema T and any one-to-one pmapping pM between S and T , there exists a query Q such that,QM,pM (D) 6 QT,pM (D).2Proof sketch: We prove the theorem by constructing a p-medschema M {M1 , M2 } and showing that for any single mediated schema T and any p-mapping pM , a query Q referring to anattribute that is clustered differently in M1 and M2 would missanswers from those generated with respect to one of M1 and M2when posed over T .2Constructing one-to-many p-mappings in practice is much harderthan constructing one-to-one p-mappings. And, when we are restricted to one-to-one p-mappings, p-med-schemas grant us moreexpressive power while keeping the process of mapping generationfeasible.4.MEDIATED SCHEMA GENERATIONThis section describes how we create the probabilistic mediatedschema. We begin by showing how to create a single mediatedschema, and then we extend the algorithm to create multiple mediated schemas with probabilities attached to each.4.1 Creating a single mediated schemaConsider a set of source table schemas S1 , . . . , Sn . We are interested in creating a mediated schema M which best represents thedomain the tables are about. Our strategy is to create M by clustering attributes in source tables. We want M to contain all “important” attributes from source tables, and we want to ensure thatsemantically similar attributes from different tables are combinedinto one cluster. For example, if two source tables have attributesphone-no and phone, we would like to put them in the samemediated attribute.Our mediated-schema generation algorithm assumes there is somepairwise attribute similarity measure, s. The similarity s(ai , aj )between two source attributes ai and aj depicts how closely thetwo attributes represent the same real-world concept. There hasbeen a significant amount of work in designing pairwise similarityfunctions [26]. Improving on these techniques is not the focus ofour work. Instead, our algorithm is designed so it can leverage anyexisting technique.We create a mediated schema in three steps. First, we removeinfrequent attributes from the set of all source attributes; that is, attribute names that do not appear in a large fraction of source tables.This step ensures that our mediated schema contains only information that is relevant and central to the domain. In the second stepwe construct a weighted graph whose nodes are the attributes thatsurvived the filter of the first step. An edge in the graph is labeledwith the pairwise similarity between the two nodes it connects. Weinclude an edge in the graph only if its weight is above a certainthreshold τ . Finally, we cluster the nodes in the resulting weighted0: Input: Source schemas S1 , . . . , Sn .Output: A set of possible mediated schemas.1: Compute A {a1 , . . . , am }, the set of all source attributes;2: for each (j [1, m]) {i [1,n] aj Si } Compute frequency f (aj ) ;n3: Set A {aj j [1, m], f (aj ) θ}; //θ is a threshold4: Construct a weighted graph G(V, E), where (1) V A, and(2) for each aj , ak A, s(aj , ak ) τ ǫ, there is an edge(aj , ak ) with weight s(aj , ak );5: Mark all edges with weight less than τ ǫ as uncertain;6: for each (uncertain edge e (a1 , a2 ) E)Remove e from E if (1) a1 and a2 are connected by apath with only certain edges, or (2) there exists a3 V , suchthat a2 and a3 are connected by a path with only certain edgesand there is an uncertain edge (a1 , a3 );7: for each (subset of uncertain edges)Omit the edges in the subset and compute a mediatedschema where each connected component in the graph corresponds to an attribute in the schema;8: return distinct mediated schemas.Algorithm 1: Generate all possible mediated schemas.graph to obtain the mediated schema. A cluster is defined to be aconnected component of the graph.4.2 Creating a p-med-schemaWe now show how to extend the algorithm we just described tocreate a probabilistic mediated schema M. Given source tablesS1 , . . . , Sn , we first construct the multiple schemas M1 , . . . , Mpin M, and then assign each of them a probability.We exploit two pieces of information available in the source tables: (1) pairwise similarity of source attributes; and (2) statistical co-occurrence properties of source attributes. Whereas the firstpiece of information tells us when two attributes are likely to besimilar, the second tells us when two attributes are likely to be different. Consider for example, source table schemasS1: (name,address,email-address)S2: (name,home-address)Pairwise string similarity would indicate that attribute address canbe similar to both email-address and home-address. However,since the first source table contains address and email-addresstogether, they cannot refer to the same concept. Hence, the firsttable suggests address is different from email-address, makingit more likely that address refers to home-address.Algorithm 1 describes how we create the multiple mediatedschemas in M from S1 , . . . , Sn and a pairwise similarity function s. Steps 1–3 of the algorithm find the attributes that occurfrequently in the sources. Steps 4 and 5 construct the graph ofthese high-frequency attributes. Unlike the graph constructed inSection 4.1, we allow an error ǫ on the threshold τ for edge weights.We thus have two kinds of edges: certain edges, having weight atleast τ ǫ, and uncertain edges, having weight between τ ǫ andτ ǫ.Steps 6-8 describe the process of obtaining multiple mediatedschemas. Specifically, a mediated schema in M is created for every subset of the uncertain edges. For every subset, we considerthe graph resulting from omitting that subset from the graph. Themediated schema includes a mediated attribute for each connectedcompone

Data integration systems offer a single-point interface toa set of data sources. A data integration application is typically built by creating a mediated schema for the domain at hand, and creating se-mantic mappings between the schemas of the data sources and the mediated schema. The user (or application) poses queries using the