DATA VAULT MODELING GUIDE - WordPress

Transcription

DATA VAULT MODELING GUIDEIntroductory Guide to Data Vault ModelingGENESEE ACADEMY, LLC2012Authored by: Hans Hultgren

DATA VAULT MODELING GUIDEIntroductory Guide to Data Vault ModelingF o rwardData Vault modeling is most compelling when applied to an enterprise data warehouseprogram (EDW). Several key decisions concerning the type of program, related projects, andthe scope of the broader initiative are then answered by this designation. In short, theorganization contemplating this initiative is committing to an integrated, non-volatile, timevariant and business key driven data warehouse program.The data vault principles are specifically well suited for such a program and – when appliedconsistently – can provide the organization with some very compelling benefits. Theseinclude auditability, agility, adaptability, alignment with the business, and support foroperational data warehousing initiatives.DATA VAULT MODELING GUIDE 5/15/2012To gain these benefits however, the organization will need to commit to both EDW programlevel factors as well as specific data vault modeling patterns, rules and methods. This guidepresents data vault modeling in the context of the EDW.1

I nde xDATA VAULT MODELING GUIDE 5/15/2012FORWARD.1INDEX .2THE EDW P ROGRAM .3THE DATA VAULT FUNDAMENTALS.4MODELING WITH THE DATA VAULT.6THINK DIFFERENTLY.7THE BUSINESS KEY .9BUSINESS KEY ALIGNMENT. 10ARCHITECTURE . 12SAMPLE DATA VAULT MODEL. 13HYBRID TABLES . 14APPLYING THE DATA VAULT . 14FINAL NOTE . 152

The EDW P ro gramThe Enterprise Data Warehousing (EDW) Program represents the ongoing data warehousingactivities of the organization. These activities will include the maintenance functions of thedata warehouse in addition to the continuous flow of incremental projects related to theenterprise data warehouse. These incremental projects are comprised ofa) Adapting to new data sources from internal new systems, external integrations,and from acquisitions, andb) Absorbing changes to existing sources including new tables, new attributes, newdomain values, new formats and new rules, andc) Adapting to new business rules concerning the alignment, grain, cardinality anddomain values of business keys as well as changes to the relationships betweenthem, andd) Accommodating new downstream delivery requirements including new subjectareas, new business rules, additional regulatory and other compliance reportingand changes to operational latency requirements.DATA VAULT MODELING GUIDE 5/15/2012For this reason, the EDW itself is not designated a “project” (there is no discernablebeginning and end, and no pre-determined set of specific goals).3In a broader sense, this program can be defined as the BI Function or BI Program within anorganization. To be clear however, this is not simply the group that owns the OLAP tools.This is the higher level view of all data warehousing and business intelligence (DWBI) withinthe organization which includes the business intelligence competency center or BICC, theEDW or CDW team, the related governance components and the environment both technicaland organizational. The success of a DWBI program depends on an organizationalcommitment and a corporate BI culture.It is precisely in this context where the data vault approach is the most valuable. So theData Vault EDW is defined first and foremost by the enterprise wide, long term DWBIprogram – from a technical architecture perspective and from an organizational culturalalignment perspective as well.

The Dat a Va ult F undamentalsThe data vault consists of three core components, the Hub, Link and Satellite. Above allother DV Program rules and factors, the commitment to the consistency and integrity ofthese constructs is paramount to a successful DV Program.The Hub represents a Core Business Concept such as Customer, Vendor, Sale or Product.The Hub table is formed around the Business Key of this concept and is established the firsttime a new instance of that business key is introduced to the EDW. It may require a multiplepart key to assure an enterprise wide unique key however the cardinality of the Hub must be1:1 with a single instance of the business concept. The Hub contains no descriptiveinformation and contains no FKs. The Hub consists of the business key only, with awarehouse machine sequence id, a load date/time stamp and a record source.Fig. 1 Hu bFig. 2 Li n kNotice the similarity between the Hub and the Link. Both represent the first time a corebusiness concept (Hub) or natural business relationship (Link) is introduced to the DW.DATA VAULT MODELING GUIDE 5/15/2012A Link represents a natural business relationships between business keys and isestablished the first time this new unique association is presented to the EDW. It canrepresent an association between several Hubs and sometimes other Links. It doesmaintain a 1:1 relationship with the unique and specific business defined associationbetween that set of keys. Just like the Hub, it contains no descriptive information. The Linkconsists of the sequence ids from the Hubs and Links that it is relating only, with awarehouse machine sequence id, a load date/time stamp and a record source.4

The Satellite contains the descriptive information (context) for a business key. There canbe several Satellites used to describe a single business key (or association of keys) howevera Satellite can only describe one key (Hub or a Link). There is a good amount of flexibilityafforded the modelers in how they design and build Satellites. Common approaches includeusing the subject area, rate of change, source system, or type of data to split out contextand design the Satellites. The Satellite is keyed by the sequence id from the Hub or Link towhich it is attached plus the date/time stamp to form a two part key.Note that the Satellite then is the only construct that manages time slice data (datawarehouse historical tracking of values over time).Fig. 3 Sat el l i t eA Satellite does not have a Sequence ID of its own and in fact cannot have a different keythan the Hub or Link sequence to which it is attached. Further, a Satellite does not have anyforeign key constraints (no snow-flaking, branching or bridging).DATA VAULT MODELING GUIDE 5/15/2012These three constructs are the building blocks for the DV EDW. Together they can be usedto represent all integrated data from the organization. The Hubs are the business keys, theLinks represent all relationships and the Satellites provide all the context and changes overtime.5Fig. 4 D at a Vau l t M odel

When we look at the Hub and Link together, they form the backbone or “Skeletal Structure”of the model. This backbone model represents a 1:1 relationship with core BusinessConcepts and their natural business relationships.Fig. 5 Bac kbon e or Skel et al St ru c t u reNote that all context (descriptive information) and all history are found in the Satellites.Mo de ling w it h t he Da ta Va ultThe process of modeling with the Data Vault is closely aligned with business analysis. Thefirst step is to identify the Hubs for the given subject area. Once the Hubs are defined wenext model the natural business relationships between these Hubs. Then we design andadd the Satellites to provide context to these constructs.TASK1.1Identify Business Concepts1.2Establish EWBK for Hubs1.3Model Hubs2.1Identify Natural Business Relationships2.2Analyze Relationships Unit of Work2.3Model Links3.1Gather Context Attributes to Define Keys3.2Establish Criteria & Design Satellites3.3Model SatellitesFig. 6 St eps t o model i n g wi t h D at a Vau l tThis process is not concerned with separating facts from dimensions, or from separatingmaster entities from events or transactions. The focus is squarely on core businessDATA VAULT MODELING GUIDE 5/15/2012STEP6

concepts – and their unique business keys. In that regard, all of the above are candidatesfor Hubs. For example events including transactions are modeled as Hubs.Think Dif ferentlyModeling with Data Vault requires us to think differently. Most of us first learned 3NFmodeling for operational databases. To manage third normal form, all attributes in an Entitymust depend directly on the key of that Entity. So the context attributesthat describe a customer (last name, first name, address, city, state,postal code, home phone, mobile phone, etc.) must be placed in theCustomer Entity where the key uniquely identifies an instance of acustomer. If we included attributes that do not depend on the key of thatentity then we would not be in 3rd normal form. Likewise if we placedsome of the attributes that depend on that key into another entity thenagain we would no longer be in 3rd normal form.At some point we may have also learned how to model using dimensionalmodeling techniques. Though different modeling constructs and otherrules for modeling, the concept of including context attributes inside atable with a key for those attributes remains the same. A ConformedDimension requires that context attributes depend on the key of thatDimension. Again if we move out attributes depending on a dimension keyto some other construct then we no longer have a conformed dimension.DATA VAULT MODELING GUIDE 5/15/2012Shown here is a Customer Entity in 3NF where we can see the BusinessKey (Customer Code), the relationship (Customer Class SID) and all theFig. 7 3NFcontext in the form of all remaining attributes in the table. Notice that this is one tableincluding all of this these components.7Fig. 8 D at a Vault CustomerCu s t omerWith Data Vault modeling we separatethe business keys from the relationshipsfrom the context. All of the businesskeys are modeled as Hubs, allrelationships and associations aremodeled as Links, and all context andhistory is provided for through theSatellites. Shown here we can see thatthe Business Key (Customer Code) is ina Hub (H Customer), the relationship(Customer Class ID) is in a Link(L Customer Cust Class), and thecontext is modeled in several Satellites.

Look back to the 3NF model and now consider that all of the same information (the samecomponents of data) about “Customer” are represented fully in both models. Interestinglyboth models represent a dependency on a single business key. Actually if we draw circlesaround each of these models we can see that what is inside each circle is a representationof the same single business key, the same set of attributes and the same relationship.Notice here that the reaching out from the “Customer”to the Customer Class is modeled through arelationship with a FK inside the 3NF circle.The same is true for the DVcircle in that reaching out from“Customer” to Customer Class ismodeled though a relationship(Link) with a FK in that Link andon the perimeter of the circle.Fig. 9 3NF M odelThis means that a) all things in eithercircle are dependent on a singlebusiness key, b) relationships passthrough the circle directly from tablewith the BK, c) the only grain shift ineither circle is based on Date/Timestamp for the purpose of trackinghistory.Fig. 10 D at a Vau l t M odelHINT: As you progress with Data Vault Modeling, this view of thinking differently willbecome more and more important. We tend to see tables the same way we havealways see them. For this reason, we tend to re-combine keys with relationships withcontext. But as soon as you do, you actually stop vaulting and return to other formsof modeling. So before you change the grain of a satellite, include a relationship FKin a Satellite or Hub, please consider the above circles analysis and reconsider.DATA VAULT MODELING GUIDE 5/15/2012It is important to think of the DVcircle in the same way as the 3NFcircle.8

The Bus iness K e yAt the core of the Data Vault is the Hub which we refer to as the business key. Perhaps themost important initial step in modeling a DV EDW is to identify and thoughtfully design thesebusiness keys. To begin with, a Business Key is representative of the core business entitylike “customer” or “product” for example. In addition, the BK also represents event basedkeys such as “sale” or “transfer”. In this way, the design process for the Data Vault does notconcern itself with the differences between the person/place/thing type entities and theevent type entities. To put this another way, we are not concerned with differentiatingDimensions from Facts but rather are focusing on identifying Business Keys which canrepresent either.This approach is then different from traditional approaches for modeling operationalsystems or

data warehouse in addition to the continuous flow of incremental projects related to the enterprise data warehouse. These incremental projects are comprised of a) Adapting to new data sources from internal new systems, external integrations, and from acquisitions, and b) Absorbing changes to existing sources including new tables, new attributes, new domain values, new formats and new rules .