The Data Warehouse Life Cycle Toolkit - Health Research Web

Transcription

i

The Data Warehouse Lifecycle ToolkitTable of ContentsChapter 1- The Chess PiecesSection 1 - Project Management and RequirementsChapter 2Chapter 3Chapter 4- The Business Dimensional Lifecycle- Project Planning and Management- Collecting the RequirementsSection 2 - Data DesignChapter 5Chapter 6Chapter 7- A First Course on Dimensional Modeling- A Graduate Course on Dimensional Modeling- Building Dimensional ModelsSection 3 - ArchitectureChapter 8Chapter 9Chapter 10Chapter 11Chapter 12Chapter 13- Introducing Data Warehouse Architecture- Back Room Technical Architecture- Architecture for the Front Room- Infrastructure and Metadata- A Graduate Course on the Internet and Security- Creating the Architecture Plan and Selecting ProductsSection 4 – ImplementationChapter 14Chapter 15Chapter 16Chapter 17- A Graduate Course on Aggregates- Completing the Physical Design- Data Staging- Building End User ApplicationsSection 5 - Deployment and GrowthChapter 18 - Planning the DeploymentChapter 19 - Maintaining and Growing the Data Warehouseii

The Purpose of Each Chapter1. The Chess Pieces. As of the writing of this book, a lot of vague terminology wasbeing tossed around in the data warehouse marketplace. Even the term datawarehouse has lost its precision. Some people are even trying to define the datawarehouse as a nonqueryable data resource! We are not foolish enough to think wecan settle all the terminology disputes in these pages, but within this book we willstick to a very specific set of meanings. This chapter briefly defines all the importantterms used in data warehousing in a consistent way. Perhaps this is something likestudying all the chess pieces and what they can do before attempting to play a chessgame. We think we are pretty close to the mainstream with these definitions.Section 1: Project Management and Requirements2. The Business Dimensional Lifecycle. We define the complete BusinessDimensional Lifecycle from 50,000 feet. We briefly discuss each step and giveperspective on the lifecycle as a whole.3. Project Planning and Management. In this chapter, we define the project and talkabout setting its scope within your environment. We talk extensively about thevarious project roles and responsibilities. You won’t necessarily need a full headcountequivalent for each of these roles, but you will need to fill them in almost anyimaginable project. This is a chapter for managers.4. Collecting the Requirements. Collecting the business and data requirements is thefoundation of the entire data warehouse effort—or at least it should be. Collecting therequirements is an art form, and it is one of the least natural activities for an ISorganization. We give you techniques to make this job easier and hope to impressupon you the necessity of spending quality time on this step.Section 2: Data Design5. A First Course on Dimensional Modeling. We start with an energetic argument forthe value of dimensional modeling. We want you to understand the depth of ourcommitment to this approach. After performing hundreds of data warehouse designsand installations over the last 15 years, we think this is the only approach you canuse to achieve the twin goals of understandability and performance. We then revealthe central secret for combining multiple dimensional models together into a coherentwhole. This secret is called conformed dimensions and conformed facts. We call thisapproach the Data Warehouse Bus Architecture. Your computer has a backbone,called the computer bus, that everything connects to, and your data warehouse has abackbone, called the data warehouse bus, that everything connects to. Theremainder of this chapter is a self-contained introduction to the science ofdimensional modeling for data warehouses. This introduction can be viewed as anappendix to the full treatment of this subject in Ralph Kimball’s earlier book, The DataWarehouse Toolkit.6. A Graduate Course on Dimensional Modeling. Here we collect all the hardestdimensional modeling situations we can think of. Most of these examples come fromspecific business situations, such as dealing with a monster customer list.7. Building Dimensional Models. In this chapter we tackle the issue of how to createthe right model within your organization. You start with a matrix of data marts anddimensions, and then you design each fact table in each data mart according to thetechniques described in Chapter 5. The last half of this chapter describes the real-lifemanagement issues in applying this methodology and building all the dimensionalmodels needed in each data mart.Section 3: Architecture8. Introducing Data Warehouse Architecture. In this chapter we introduce all thecomponents of the technical architecture at a medium level of detail. This paints theoverall picture. The remaining five chapters in this section go into the specific areasof detail. We divide the discussion into data architecture, application architecture, andinfrastructure. If you follow the Data Warehouse Bus Architecture we developed inChapter 5, you will be able to develop your data marts one at a time, and you will endiii

up with a flexible, coherent overall data warehouse. But we didn’t say it would beeasy.9. Technical Back Room Architecture. We introduce you to the system componentsin the back room: the source systems, the reporting instance, the data staging area,the base level data warehouse, and the business process data marts. We tell youwhat happened to the operational data store (ODS). We also talk about all theservices you must provide in the back room to get the data ready to load into yourdata mart presentation server.10. Architecture for the Front Room. The front room is your publishing operation. Youmake the data available and provide an array of tools for different user needs. Wegive you a comprehensive view of the many requirements you must support in thefront room.11. Infrastructure and Metadata. Infrastructure is the glue that holds the datawarehouse together. This chapter covers the nuts and bolts. We deal with the detailwe think every data warehouse designer and manager need to know abouthardware, software, communications, and especially metadata.12. A Graduate Course on the Internet and Security. The Internet has a potentiallyhuge impact on the life of the data warehouse manager, but many data warehousemanagers are either not aware of the true impact of the Internet or they are avoidingthe issues. This chapter will expose you to the current state of the art on Internetbased data warehouses and security issues and give you a list of immediate actionsto take to protect your installation. The examples throughout this chapter are slantedtoward the exposures and challenges faced by the data warehouse owner.13. Creating the Architecture Plan and Selecting Products. Now that you are asoftware, hardware, and infrastructure expert, you are ready to commit to a specificarchitecture plan for your organization and to choose specific products. We talk aboutthe selection process and which combination of product categories you need. Bear inmind this book is not a platform for talking about specific vendors, however.Section 4: Implementation14. A Graduate Course on Aggregations. Aggregations are prestored summaries thatyou create to boost performance of your database systems. This chapter divesdeeply into the structure of aggregations, where you put them, how you use them,and how you administer them. Aggregations are the single most cost-effective way toboost performance in a large data warehouse system assuming that the rest of yoursystem is constructed according to the Data Warehouse Bus Architecture.15. Completing the Physical Design. Although we don’t know which DBMS and whichhardware architecture you will choose, there are a number of powerful ideas at thislevel that you should understand. We talk about physical data structures, indexingstrategies, specialty databases for data warehousing, and RAID storage strategies.16. Data Staging. Once you have the major systems in place, the biggest and riskieststep in the process is getting the data out of the legacy systems and loading into thedata mart DBMSs. The data staging area is the intermediate place where you bringthe legacy data in for cleaning and transforming. We have a lot of strong opinionsabout what should and should not happen in the data staging area.17. Building End User Applications. After the data is finally loaded into the DBMS, westill have to arrange for a soft landing on the users’ desktops. The end userapplications are all the query tools and report writers and data mining systems forgetting the data out of the DBMS and doing something useful. This chapter describesthe starter set of end user applications you need to provide as part of the initial datamart implementation.Section 5: Deployment and Growth18. Planning the Deployment. When everything is ready to go, you still have to roll thesystem out and behave in many ways like a commercial software vendor. You needto install the software, train the users, collect bug reports, solicit feedback, andrespond to new requirements. You need to plan carefully so that you can deliveraccording to the expectations you have set.iv

19. Maintaining and Growing the Data Warehouse. Finally, when your entire datamart edifice is up and running, you have to turn around to do it again! As we saidearlier, the data warehouse is more of a process than a project. This chapter is anappropriate end for the book, if only because it leaves you with a valuable lastimpression: You are never done.Supporting Tools Appendix A. This appendix summarizes the entire project plan for the BusinessDimensional Lifecycle in one place and in one format. All of the project tasks and rolesare listed. Appendix B. This appendix is a guided tour of the contents of the CD-ROM. All of theuseful checklists, templates, and forms are listed. We also walk you through how to useour sample design of a Data Warehouse Bus Architecture. CD-ROM. The CD-ROM that accompanies the book contains a large number of actualchecklists, templates, and forms for you to use with your data warehouse development. Italso includes a sample design illustrating the Data Warehouse Bus ArchitectureThe Goals of a Data WarehouseOne of the most important assets of an organization is its information. This asset isalmost always kept by an organization in two forms: the operational systems of recordand the data warehouse. Crudely speaking, the operational systems of record are wherethe data is put in, and the data warehouse is where we get the data out. In The DataWarehouse Toolkit, we described this dichotomy at length. At the time of this writing, it isno longer so necessary to convince the world that there are really two systems or thatthere will always be two systems. It is now widely recognized that the data warehousehas profoundly different needs, clients, structures, and rhythms than the operationalsystems of record.Ultimately, we need to put aside the details of implementation and modeling, andremember what the fundamental goals of the data warehouse are. In our opinion, thedata warehouse: Makes an organization’s information accessible. The contents of the datawarehouse are understandable and navigable, and the access is characterized by fastperformance. These requirements have no boundaries and no fixed limits.Understandable means correctly labeled and obvious. Navigable means recognizingyour destination on the screen and getting there in one click. Fast performance meanszero wait time. Anything else is a compromise and therefore something that we mustimprove. Makes the organization’s information consistent. Information from one part of theorganization can be matched with information from another part of the organization. Iftwo measures of an organization have the same name, then they must mean the samething. Conversely, if two measures don’t mean the same thing, then they are labeleddifferently. Consistent information means high-quality information. It means that all ofthe information is accounted for and is complete. Anything else is a compromise andtherefore something that we must improve. Is an adaptive and resilient source of information. The data warehouse is designedfor continuous change. When new questions are asked of the data warehouse, theexisting data and the technologies are not changed or disrupted. When new data isadded to the data warehouse, the existing data and the technologies are not changedor disrupted. The design of the separate data marts that make up the data warehousemust be distributed and incremental. Anything else is a compromise and thereforesomething that we must improve. Is a secure bastion that protects our information asset. The data warehouse notonly controls access to the data effectively, but gives its owners great visibility into theuses and abuses of that data, even after it has left the data warehouse. Anything elseis a compromise and therefore something that we must improve.v

Is the foundation for decision making. The data warehouse has the right data in it tosupport decision making. There is only one true output from a data warehouse: thedecisions that are made after the data warehouse has presented its evidence. Theoriginal label that predates the data warehouse is still the best description of what we aretrying to build: a decision support system.The Goals of This BookIf we succeed with this book, you—the designers and managers of large datawarehouses—will achieve your goals more quickly. You will build effective datawarehouses that match well against the goals outlined in the preceding section, and youwill make fewer mistakes along the way. Hopefully, you will not reinvent the wheel anddiscover “previously owned” truths.We have tried to be as technical as this large subject allows, without getting waylaid byvendor-specific details. Certainly, one of the interesting aspects of working in the datawarehouse marketplace

the data is put in, and the data warehouse is where we get the data out. In The Data Warehouse Toolkit, we described this dichotomy at length. At the time of this writing, it is no longer so necessary to convince the world that there are really two systems or that there will always be two systems. It is now widely recognized that the data warehouse has profoundly different needs, clients .