Tracing Nested Data With Structural Provenance For Big Data Analytics PDF Free Download

1y ago

16 Views

1 Downloads

1.13 MB

12 Pages

Report/dmca

Download PDF

Transcription

Tracing nested data with structural provenancefor big data analyticsRalf DiestelkämperMelanie HerschelIPVS - University of StuttgartStuttgart, Germanyralf.diestelkaemper ared to solutions that are fully integrated into the big dataanalytics or data-intensive, scalable computing (DISC) system.We, therefore, present a DISC system integrated provenancesolution for nested data that is as efficient and scalable as solutionsof the first category while, at the same time, at least as accurateas solutions of the second category [9]. Our solution leveragesour newly defined structural provenance to provide both features.Structural provenance records identifiers for top-level dataitems only. For attributes and nested data items, it captures pathson a schema level. To provide accurate provenance when queried,it employs these identifiers and paths to trace back individualnested items at attribute level. Capturing paths instead of identifier annotations for nested data further allows us to distinguishbetween paths that are used for access (e.g., during filtering) ormanipulation (e.g., during flattening). We can thereby differentiate contributing attributes, i.e., attributes needed to reproduce aresult item, and influencing attributes that are accessed duringdata processing but not required to reproduce a result. This distinction, which is unique compared to existing data provenancemodels, qualifies structural provenance for use-cases beyond debugging such as auditing or determining data-usage patterns forpartitioning, data compression, and workload optimization.Auditing. Auditing aims at identifying and analyzing databreaches. These breaches commonly stem from attacks of company insiders who extract sensitive data by querying data andleaking the query result. Auditing solutions are designed toidentify both these insiders and the customers whose data areleaked [19]. To address the latter challenge, the solutions typically leverage some sort of data provenance. It serves to identifythose input tuples that are exposed in a leaked query result. However, after the European Union has introduced the Europeangeneral data protection regulation GDPR [26], European companies are not only required to identify the customers (tuples)whose data are leaked, but also which of their data are leaked (i.e.,attributes such as name, address, or payment details). Structuralprovenance precisely provides the attributes and items in nestedcollections that contribute to a query result. Unlike existing dataprovenance solutions, it further reveals which attributes are notexposed in the result but have influenced it to create awarenessfor reconstruction attacks.Data-usage patterns. Data-usage patterns reveal frequentlyused subsets of the input data over a query workload. These patterns serve to optimize data layout and compression or to improvequery performance [25]. State-of-the-art scalable provenance solutions for DISC systems can identify subsets of the input datathat are frequently used. This knowledge allows for horizontal (orrow-based) data partitioning and distribution. Structural provenance further provides all the information needed for vertical (orcolumn-based) partitioning since it reveals which attributes andnested items are accessed or manipulated. It even provides insights on attribute combinations that are frequently used togetherfor data layout optimizations.Big data analytics systems such as Apache Spark natively support nested data formats since they offer operators to manipulatenested lists and complex types. Compared to flat data, nesteddata introduces further complexity and sources of error, e.g.,when developing data processing pipelines, performing auditingtasks, or performance tuning. To ease such tasks, we propose aprovenance-based solution tailored to nested data processing inbig data analytics systems. Unlike previous solutions, it combines(i) tracing provenance of nested data with (ii) efficient and scalableprovenance processing, leveraging a newly proposed structuralprovenance that traces structural manipulations through dataprocessing pipelines in addition to data. We provide a formal definition of structural provenance, as well as methods to efficientlycapture and succinctly backtrace it. We implement them in ourPebble system in Apache Spark and validate its performance andusefulness on up to 500GB of real-world data.1MOTIVATIONBig data analytics systems such as Apache Spark or Flink arefrequently the means of choice to build data processing pipelinesthat process large quantities of nested data. These pipelines transform nested lists and complex types stored in nested data formatslike JSON, protocol buffer, or parquet. Provenance solutions thatcapture meta-data about the data processing [14] have proven tobe useful for analyzing the internals of data processing pipelines,e.g., for debugging purposes. These solutions typically have twophases, a provenance capture phase to collect the meta-data anda provenance query phase to analyze the meta-data. For big dataanalytics systems, we distinguish two categories of provenancesolutions: (i) Efficient and scalable solutions that track individual,flat data items (i.e., tuples) from the input to the output over eachexecution step [15–17, 22]. They capture so-called lineage or whyprovenance [7]. (ii) System prototypes that compute provenancepolynomials of nested data [2, 28]. They capture how-provenance,which provides both the input items contributing to the resultand the data combination process an item undergoes.Solutions of the first category fail to track nested items accurately. Solutions of the second category do not efficiently scaleto big data processing pipelines. To capture the how-provenance,these solutions propagate the growing provenance polynomialthrough the entire pipeline or require annotation of each nestedelement, which imposes a very high and practically unacceptableoverhead [16, 17]. Further, these solutions have to offload theprovenance to external tools to query the captured provenance.Thereby, they miss potential performance and usability benefits 2020 Copyright held by the owner/author(s). Published in Proceedings of the23rd International Conference on Extending Database Technology (EDBT), March30-April 2, 2020, ISBN 978-3-89318-083-7 on OpenProceedings.org.Distribution of this paper is permitted under the terms of the Creative Commonslicense CC-by-nc-nd 4.0.Series ISSN: 2367-200525310.5441/002/edbt.2020.23

2Capturing provenance imposes runtime and space overheadduring pipeline execution. The mentioned use-cases are performed infrequently. Thus, keeping the overhead low duringpipeline execution is essential to ensure efficiency and scalability.During the provenance capture phase, a system can typically optfor computing and storing the provenance of all processed data(eager approach) or decide to capture it on demand when usersquery the provenance (lazy approach). Consequently, duringthe provenance query phase, retrieving the desired provenanceis more or less time-consuming. We consider the provenancecapture and provenance query phases holistically. To this end,we devise a meet-in-the-middle approach that eagerly collectsthe necessary “pebbles” (i.e., identifiers and paths on schemalevel) during pipeline execution to later reconstruct or backtraceattribute-level provenance of nested data at query time. Our evaluation shows that capturing structural provenance introducescomparable overhead to state-of-the-art lineage solutions in DISCsystems [17], while providing attribute-level precision.This paper also presents the first provenance solution fornested data that seamlessly integrates into a big data analyticssystem (Apache Spark in our implementation). Existing solutions [2, 28] require offloading captured provenance for queryingto separate, non-distributed applications. This has three drawbacks: (i) It prevents adopting a holistic provenance capture andquerying approach to keep capture and query overhead reasonable; (ii) it forces users to leave their familiar environment; and(iii) it prevents scalable provenance querying.RUNNING EXAMPLETo distinguish our research from related work and for illustration,we use a running example based on Twitter data. Among itsroughly 1000 attributes, we focus on the tweeted text, the usertweeting, the user mentions in the tweet, and the retweet cnt.The input data is nested as shown in Tab. 1 (ignore colors andnumber annotations for now). This sample data is processed inthe big data processing pipeline shown in Fig. 1. It results in alist of distinct users associated with tweets that they authoredor were mentioned in, as shown in Tab. 2. The upper branchof the pipeline describes how authoring users become part ofthe result. Their tweets require a retweet cnt of 0 before thepipeline reduces them to the text, id str, and name. The lowerbranch processes tweets mentioning users. First, it flattens theuser mentions attribute to select the tweeted text, id str, and nameof each mentioned user. Then the pipeline unifies the results ofboth branches and groups by the user to aggregate the tweetedtexts into a nested list.In the result, a duplicate Hello World text occurs in the nestedtweets of user Lisa Paul, short lp. To find out how this potentialdata quality issue occurred, we debug the pipeline by tracing backthe duplicate texts in the context of user lp, which are highlightedin dark-green in Tab. 2. The solution presented in this paperreturns the dark- and medium-green items in Tab. 1. The darkgreen items are contributing data. They suffice to reproduce thedark-green items in the result. The medium-green items revealwhich attributes potentially influence the result of the pipeline.If trivially extended to nested data, scalable lineage solutions [15–17, 22] provide all input tweets that contain the user lp.They are highlighted in light-grey in the input. In reality, a usertypically authors more than a handful of tweets and is potentiallymentioned in more than a million tweets. These tweets wouldall be in the provenance returned by the lineage solutions. Theymask the actual two tweets causing the duplicate text.PROVision [28] supports the unnesting of data but does notexplicitly support the nesting of data. Extending it with nesting requires a list collection UDF cl, which yields the followingprovenance polynomial for the entire result item 102 in Tab. 2:Contributions and structure. To summarize, this paperpresents research on processing structural provenance in bigdata analytics systems to accurately trace nested data in an efficient, scalable, and integrated way. This approach enables noveluse-cases that arise in the context of big data processing. Afterdiscussing a running example in Sec. 2 and related work (Sec. 3)this paper covers the following contributions: Structural provenance (Sec. 4). We present a novel provenance model for nested data that tracks structural manipulations in addition to data dependencies, and distinguishesbetween data access and manipulation to support use-casesbeyond debugging. Lightweight structural provenance capture (Sec. 5). Wediscuss how to capture structural provenance in big data analytics programs composed of filter, select, map, join, union,flatten, grouping, nesting, and aggregation operations. Thecapture is devised to incur a minimal overhead compared tothe capture of flat provenance in DISC systems. Backtracing for provenance query processing (Sec. 6).We formalize the backtracing algorithm used at provenancequery time. As input, users provide a tree-pattern that, upon itsscalable execution, identifies data items for which provenanceis requested. The backtracing algorithm computes provenancefor these items based on the previously captured information. Implementation and evaluation (Sec. 7). We implementour contributions in Pebble [9], our system for integratedprovenance capturing and querying within Apache Spark. Weconduct a quantitative evaluation of runtime and space overhead incurred by our solution on two large real-world data sets,validating the scalability of our solution. In comparison to thestate-of-the-art lineage solution Titian [17], Pebble has comparable runtime and space overhead. However, as our workloadshows, Pebble provides sufficient insight to support the aboveuse-cases, unlike other solutions.(p1 p12 p17 (p29 ·P f l at t e n (p29 ·[0])))·Pcl ((p1 p12 p17 (p29 ·P f l at t e n (p29 ·[0]))),(⟨p1 ⟩ ⟨p12 ⟩ ⟨p17 ⟩ ⟨(p29 ·P f l at t e n (p29 ·[0]))⟩))textuser1Hello @ls @jm @ls212Hello World1317Hello World182229This is me@jm23Hello @lp30id strlp3user mentionsid strnamels5Lauren Smith6jm7John Miller8ls9Lauren Smith10nameLisa Paul4id strnamelp14Lisa Paul15id strnamelp19Lisa Paul20id strnamejm24 John Miller25id strnamejm31 John Miller32retweet cnt011016021id strnamejm26 John Miller27id strnamelp33Lisa Paul34028135Table 1: Example input data12readfiltertweets.jsonretweet cnt 0select3text,user.id str,user.name74readtweets.jsonflattenuser mentionsà m user5selectselecttext à tweet, id str, name à )à tweets6text,m user.id str,m user.nameFigure 1: Example processing pipeline2549

user101102103id strlsid strlpid strjmnameLauren SmithnameLisa PaulnameJohn MillertweetstextHello @ls @jm @lsHello @ls @jm @lstextHello @ls @jm @lsHello WorldHello WorldHello @lptextHello @ls @jm @lsThis is me @jmThis is me @jmprovenance trees on the inputprovenance tree on the output12textuserretweetcntid strname38Table 2: Example result datauserid str9Backtracing17text102nameretweetcntnameid str9access by operator8manipulation by operator2provenance IDs/ positionsuser attribute namesEssentially, the first line tells us that the result item is basedon the source tuples annotated with 1, 12, and 17, denoted as p1 ,as well as p12 , p17 (all these are processed by the upper branch ofthe pipeline in Fig. 1), and p29 , with some of its data flattened outduring processing (corresponds to the lower part of the pipeline).The second line makes use of our extension and describes howdata is combined by the remainder of the pipeline where the tuples mentioned above are grouped and aggregated. The exampleshows that the provenance is very verbose while not preciselytracing the dark-green data items of the user question. This is thecase since it collects tuple-based provenance polynomials only.Lipstick [2] traces provenance polynomials for each nesteditem. This allows pinpointing the dark-green nested values HelloWorld and lp correctly. However, Lipstick requires annotating allvalues, not just the tuples, e.g., 35 rather than 5 annotations, asindicated by the superscript italic numbers in Tab. 1. This entailsa significant runtime and space overhead, rendering the solutionimpractical when needing to scale to large volumes of data.We also differentiate structural provenance from whereprovenance [4], which determines where a (nested) result valueis copied from. In our example, the where-provenance (extendedto the processing pipelines we consider) would include, for thevalue lp the “cells” with superscript annotation 14, 19, and 33 ofTab. 1. This is combined with the where-provenance of the HelloWorld result values via product. The result is not sufficientlyaccurate because it cannot capture that the dark-green values ofthe output need to be traced within their common context.No existing solution allows recognizing (i) that the user attribute is unnested and nested again, (ii) that the id str, lp, andthe text attribute Hello World are subject to different, independent, structural manipulations, and (iii) that the medium-greenretweet cnt and name values in Tab. 1 are accessed for filteringand grouping, respectively. Even though these values are notneeded to reproduce the queried result, they are influencing theresult, which is valuable information in certain use-cases. Structural provenance captures all this information since it capturesnot only data dependencies but also path dependencies.To get an understanding of querying structural provenance,consider the right tree in Fig. 2. The string labels of tree nodesdenote attribute names whereas numbers refer to provenanceids (e.g., 102) or positions in nested collections (e.g., 2 and 3).The displayed tree represents the structure associated with oursample user query. It encodes the path to user lp and the duplicateHello World items in the context of top-level data item 102. Notethat name is absent from this tree since it is not pertinent to theuser query. Backtracing this tree yields the two trees on the leftof Fig. 2. These distinguish between data items that contribute tothe result (dark-green) and data items that influence it (mediumgreen). The nodes match the green items in Tab. 1. A closerlook at the medium-green name node reveals that this nodeinfluences the queried result since it is accessed for groupingusertweets23texttextcontributing attributesinfluencing attributesFigure 2: Example provenance trees. Tracing the tree onthe right back to the input yields the trees on the left(light-blue 9). Similarly, the retweet cnt influences the result sinceit is accessed for filtering. Further, the name node undergoesstructural manipulations at operator 3 and 8 (dark-blue).3RELATED WORKThis section generalizes the discussion of existing approachesthat we provided along with the running example. We divide ourdiscussion into research on data provenance in DISC systemsand provenance models for nested data, summarized in Tab. 3.3.1Data provenance in DISC systemsData provenance has been studied for various applications [14].While the majority of approaches has focused on relational dataprocessed by relational queries, first solutions have emerged fortracing provenance in DISC systems such as Titian [12, 16, 17] forSpark, Lipstick for PigLatin (Hadoop) [2], as well as RAMP [15],Newt [22], and PROVision [28] for multiple DISC systems.Titian, RAMP, and Newt trace lineage of data items, i.e., theydetermine which top-level data items contribute to which outputitem. These solutions scale well but do not trivially extend tonested data. PROVision extends the provenance model for toplevel data (or flat) items to also capture provenance of data itemsin nested collections. It lacks information on attribute level access.Lipstick is the only solution that supports provenance capture fornested data at attribute level. However, it requires annotationsfor each data value, not only the top-level data items. Structuralprovenance provides provenance on attribute level but requiresannotation on top-level items only since it records access toattributes and nested data using paths. These paths are recordedon a schema level, saving space and runtime overhead.All the above solutions except for Titian require offloadingthe provenance to an external tool. Titian integrates provenancequerying directly into the DISC system. Thus, provenance queriescan be integrated into a big data processing pipeline just like anyother query. Our system extends Titian’s integrated queryingmeans with tree-patterns [13, 23] to address combinations ofnested data items. Further, we present the first solution thattracks access and manipulation of attributes.3.2Provenance models for nested dataFocusing on nested data, at least three major directions to formalize provenance models have been researched: (i) models forwhy-, how-, and where-provenance, (ii) graph-based provenancemodels, and (iii) program slicing models.For unions of conjunctive queries, Buneman et al. [4] define awhy- and where-provenance model for nested data. This model255

Reported ovNestedWhy/Where ProvKwasnikowaAcar / /SparkRDDs/ / Hadoop/ / / Hadoop/HyracksEvaluated for scalability/ /n.a.n.a.n.a. n.a.n.a.na./ / / / Program SlicingStructural Prov/// / / PigLatinJavanononoHaskellHaskell /Spark DatasetsSupported Not SupportedFeatureData provenance for nested dataProvenance of acces and manipulationProvenance of data item structureEager/lazy provenance computationImplementation-independentprovenance query formalismDISC system compatibility/integrationTable 3: Feature overview of related workNameNotationType τ (·)ConstantcI nt Doubl e St r inд .⟨a 1 : τ (v 1 ), ., a n : τ (v n ))⟩Data item d ⟨a 1 : v 1 , ., a n : v n ⟩{{τ (d)}} , d , d ′ B, τ (d ) τ (d ′ )BagB {{d 1 , ., d n }}Set S {d 1 , ., d n } d 1 , . , d n {τ (d)} , d , d ′ S , τ (d ) τ (d ′ )does not extend to the programs defining data analytics pipelinesin DISC systems, like the one shown in Fig. 1, since they mayinclude map or reduce functions or any other higher-order functions in general. To model the how-provenance of nested data, asemiring-model for a subset of XQuery has been proposed [10, 18].However, this model does not include complex operations overnested data, such as aggregations. The only how-provenancemodel supporting aggregations that we are aware of applies torelational data only [3].Lipstick [2], makes use of a graph model to describe the howprovenance. This model only applies semiring annotations wherepossible. It provides no formal model definition for aggregations,nesting, and flattening of nested data. Kwasnikowska and Acar etal. [1, 20] also employ a graph-based provenance model to tracknested data items. These solutions are essentially limited to theoperations defined in the Nested Relational Calculus (NRC) [5],which do not include aggregations or joins. Also, the reportedimplementation and evaluation (if any) indicate that they neitherintegrate nor scale sufficiently to apply on DISC systems.All provenance solutions mentioned so far do not distinguishbetween access and manipulation as they focus on tracing datavalues. In that respect, the work closest to our structural provenance model is the program slicing model [6], which tracks provenance traces for NRC operators over nested data. To provideformal guarantees, the model is limited to a small set of semantically fully specified NRC operators. In practice, it is infeasibleto provide semantics for all higher-order functions such as mapoperations, which allow for user-defined functions. Via traceslicing it is possible to query provenance for individual nesteditems. However, like the other described models, this model is designed to trace data values and manipulations of them rather thanstructural manipulations. It is not expressive enough to faithfullycapture and query structural manipulations. Its implementationis not designed or evaluated for efficiency or scalability.The final column of Tab. 3 summarizes the capabilities of oursystem, which we have highlighted previously. These capabilitiesare based on processing structural provenance, discussed next.4Table 4: Notation and types for nested collectionsa list of ai : vi pairs. Attribute names a 1, ., an are unique labelswithin each data item. Values v 1, ., vn may be bags, sets, dataitems, or constants, i.e., v B S c d.The type of D is defined recursively based on the type of itsbuilding blocks as described in Tab. 4, where τ (·) returns thetype of its parameter. Bags and sets are restricted to containingelements of the same type.Example 4.2. All data shown in our running example conformto the above definition. The result data of Tab. 2 has type:{{ ⟨user :⟨id st r :St r inд,name:St r inд ⟩,tweet s:{{ ⟨t ex t :St r inд ⟩ }}⟩ }}To access the different components defined by the data model,we define access paths, inspired by XPath expressions [24] tonavigate XML data. Provided a context data item d, an accesspath navigates to “deeper” data in the nested model. Given thatthe data model ensures the order of data items in lists, we alsomodel positional accesses in paths.Definition 4.3. (Access path w.r.t. d) In the context of a dataitem d, we define an access path p by p d.p ′ , p ′ x x .p ′ ,x a a[i]. Here, p ′ is the path accessing x either directlyor recursively. The accessed x is either an attribute a in theschema of the context data item, evaluating to its value, or thei-th component of a, denoted a[i], evaluating to the item at thei-th position of a’s value. For the recursive definition of p ′ , thecontext data item is updated to the item referred to by x.For simplification, we denote a path p with context data item dby pd when the context is not clear. We refer to the enumerationof all paths that exist in a context d as path set PS d .Example 4.4. Considering the data item d 102 in Fig. 2, thepath d 102 .tweets evaluates to a list of four data items. Pathd 102 .tweets[2].text points to the first Hello World in that list.STRUCTURAL PROVENANCE4.2This section formalizes structural provenance. We first presentthe data model and the execution model to define the corresponding structural provenance model afterwards.4.1Execution modelThe execution model defines the processing semantics of dataanalytics programs like the one in Fig. 1. These programs processdata complying with our data model. We model a program asa directed acyclic graph (DAG) of individual operators, suchas filter, flatten, join, etc. Each operator has its own executionsemantics.Data modelDISC systems process collections of typed nested data items,which we refer to as (nested) datasets. These datasets supportpositional access, and, thus, the handling of ordered datasets.Definition 4.5. (Operator) An operator O takes a set of datasetsI {I 1, . . . , Ik } as input and returns a single result dataset R.Inference rules describe the execution semantics of an operatorO. O has a unique identifier, a type, and its arguments.Definition 4.1. (Nested dataset) A nested dataset D comprisesconstants, data items, bags, and sets, denoted and typed as shownin Tab. 4. D is a list of data items d 1, . . . , dn with or withoutduplicates (ordered bag vs. set), i.e., D B S. Each data item d isDefinition 4.6. (Program execution model) Let G(V , E) be a DAG.V {O 1, . . . , O n } is the set of algebraic operators and E the set256

I4text user mentionsid strnamelsLauren Smith11 Hello @ls @jm @ls 2 jm3John Miller I5text user mentions 42 Hello @ls @jm @ls 43 Hello @ls @jm @ls .ℐi ℐ#%&User mentions[1]namejmJohn MillerExample 4.11. To illustrate our model for structural provenance, we focus on the f latten operator labeled 5 in Fig. 1. Itunnests the nested items of attribute user mentions. An excerptfrom its input and output data is given at the top of Fig. 3. Blackheaders encode bag data types, light gray headers identify complex data items as nested data type. At the bottom left, Fig. 3shows the provenance for the result items 42 and 43. The flatten operator derives item 42 from input item 1. It accesses pathuser mentions[1] as recorded in A. Further, it copies the firstuser of the user mentions list (and implicitly, all paths in the pathset PS user ment ions[1] ) to the new item m user as recorded inthe mapping M. Ignore the bottom right for now.6select() 5, -./0012, ℐ) , ℳ) , 3)%&%'User mentions[1]m userℐ) ℳ) 4, 5617 9120:;26 ;65617 9120:;26 ;6 , 9 56173) 1 I4id stran operator (which is the previously mentioned lineage), it alsorecords accesses and manipulations in M and A while transforming input items to result data items.Lightweight Provenance Capture Modelℳ i ℐ#43Lauren SmithflattenProvenance Capture Model1 I4namels5442 id str.readrim user & && pos%&%&%'11User mentions[2]User mentions[2]m user124243 Figure 3: Provenance model and lightweight provenancecapture applied on the flatten operator from the exampleof directed edges that model the data flow. If and only if the resultset Ri of O i is in the set I j of input datasets of operator O j an edge(O i , O j ) E directed from O i to O j exists. While G may containmultiple source nodes (in-degree of 0), G has only one sink node(out-degree of 0), which outputs the final result dataset.5Based on the provenance model, we introduce inference rulesdescribing the provenance capture of our supported operators.When the rules in Tab. 5 are annotated with a , we extend anexisting inference rule from [11]. Otherwise, we formalize theinference rule with and without provenance extension. Due tospace constraints, we only show the complete set of inferencerules with provenance. After explaining the map, flatten andaggregation rule, we show how we capture the structural provenance obtained from these rules efficiently.Example 4.7. The graph in Fig. 1 represents the executionmodel of our running example. The type and parameters of eachoperator are displayed inside the operator nodes. Further, eachnode is labeled with its identifier in the top right corner.We abstract from a particular language to define the semanticsof individual operators by extending the inference rules from [11]to describe the filter, select, map, join, union, flatten, groupingand aggregation operator with their semantics.5.0.1 Map. For the map operator, we assume that the functionλ(i) over input data item i returns a result of type data item,denoted as τ (λ(i)) ⟨. . .⟩. Given this precondition, withoutprovenance capture, map returns the result of applying λ(i), foreach i in the input dataset I 1 . The inference rule for the mapoperator in Tab. 5 additionally produces the provenance for eachdata item i. More formally, map, parameterized by a function λ,produces the result provenance R, which is a bag of data items.Each data item extends the “normal” result of map, i.e., λ(i) withtwo additional attributes: the input provenance I and mappingM. For I, the only input data item participating in producing anoutput data item λ(i) is i, which originates from the single inputdataset I 1 . Because we generally do not know the internals of anarbitrary function λ, the set A is set to undefined, denoted by .Thus, I {{⟨i, I 1, ⟩}}. The structural mapping M is alsoundefined because

Big data analytics systems such as Apache Spark natively sup-port nested data formats since they o er operators to manipulate nested lists and complex types. Compared to at data, nested data introduces further complexity and sources of error, e.g., when developing data processing pipelines, performing auditing tasks, or performance tuning.