The Architecture For The Next Generation Of Data Warehousing

Transcription

DW2.0The Architecture for the Next Generation ofData WarehousingW. H. InmonForest Rim TechnologyDerek StraussGavrosheGenia NeushlossGavrosheAMSTERDAM BOSTON HEIDELBERG LONDONNEW YORK OXFORD PARIS SAN DIEGOSAN FRANCISCO SINGAPORE SYDNEY TOKYOMorgan Kaufmann Publishers is an imprint of Elsevier.КMORGAN K A U F M A N NPUBLISHERS

ContentsPrefacexviiAcknowledgmentsxxAbout the AuthorsxxiCHAPTER 1CHAPTER 2A brief history of data warehousing and first-generationdata warehouses1Database management systemsOnline applicationsPersonal computers and 4GL technologyThe spider web environmentEvolution from the business perspectiveThe data warehouse environmentWhat is a data warehouse?Integrating data—a painful experienceVolumes of dataA different development approachEvolution to the DW2.0 environmentThe business impact of the data warehouseVarious components of the data warehouse tional data storeData martExploration warehouseThe evolution of data warehousing from the business perspectiveOther notions about a data warehouseThe active data warehouseThe federated data warehouse approachThe star schema approachThe data mart data warehouseBuilding a "real" data 02122An introduction to DW 2.023DW 2.0—a new paradigmDW 2.0—from the business perspectiveThe life cycle of dataReasons for the different sectorsMetadataAccess of dataStructured data/unstructured data24242730313334

viiiContentsCHAPTER 3CHAPTER 4Textual analyticsBlatherThe issue of terminologySpecific text/general textMetadata—a major componentLocal metadataA foundation of technologyChanging business requirementsThe flow of data within DW 2.0Volumes of dataUseful applicationsDW 2.0 and referential integrityReporting in DW 2.0Summary3538384040434547485051525353DW 2.0 components—about the different sectors55The Interactive SectorThe Integrated SectorThe Near Line SectorThe Archival SectorUnstructured processingFrom the business perspectiveSummary55627176869092Metadata in DW 2.095Reusability of data and analysisMetadata in DW 2.0Active repository/passive repositoryThe active repositoryEnterprise metadataMetadata and the system of recordTaxonomyInternal taxonomies/external taxonomiesMetadata in the Archival SectorMaintaining metadataUsing metadata—an exampleFrom the end-user perspectiveSummaryCHAPTER 5Fluidity of the DW 2.0 technology infrastructureThe technology infrastructureRapid business 4

Contents ixCHAPTER 6CHAPTER 7The treadmill of changeGetting off the treadmillReducing the length of time for IT to respondSemantically temporal, semantically static dataSemantically temporal dataSemantically stable dataMixing semantically stable and unstable dataSeparating semantically stable and unstable dataMitigating business changeCreating snapshots of dataA historical recordDividing dataFrom the end-user 0121121122Methodology and approach for DW 2.0123Spiral methodology—a summary of key featuresThe seven streams approach—an overviewEnterprise reference model streamEnterprise knowledge coordination streamInformation factory development streamData profiling and mapping streamData correction streamInfrastructure streamTotal information quality management ical processing and DW 2.0141Two types of transactionsUsing statistical analysisThe integrity of the comparisonHeuristic analysisFreezing dataExploration processingThe frequency of analysisThe exploration facilityThe sources for exploration processingRefreshing exploration dataProject-based dataData marts and the exploration facilityAbackflowof dataUsing exploration data 55

xContentsCHAPTER 8CHAPTER 9From the perspective of the business analyst155Summary156Data models and DW 2.0157An intellectual road mapThe data model and businessThe scope of integrationMaking the distinction between granular and summarized dataLevels of the data modelData models and the Interactive SectorThe corporate data modelA transformation of modelsData models and unstructured dataFrom the perspective of the business ring the DW 2.0 environment169Monitoring the DW 2.0 environmentThe transaction monitorMonitoring data qualityA data warehouse monitorThe transaction monitor—response timePeak-period processingThe ETL data quality monitorThe data warehouse monitorDormant dataFrom the perspective of the business R 10 DW 2.0 and securityProtecting access to dataEncryptionDrawbacksThefirewallMoving data offlineLimiting encryptionA direct dumpThe data warehouse monitorSensing an attackSecurity for near line dataFrom the perspective of the business userSummaryiei181181182182182184184185185187187188

Contents x iCHAPTER 11 Time-variant dataAll data in DW 2.0—relative to timeTime relativity in the Interactive SectorData relativity elsewhere in DW 2.0Transactions in the Integrated SectorDiscrete dataContinuous time span dataA sequence of recordsNonoverlapping recordsBeginning and ending a sequence of recordsContinuity of dataTime-collapsed dataTime variance in the Archival SectorFrom the perspective of the end userSummaryCHAPTER 12 Theflowof data in DW 2.0The flow of data throughout the architectureEntering the Interactive SectorThe role of ETLData flow into the Integrated SectorData flow into the Near Line SectorData flow into the Archival SectorThe falling probability of data accessException-based flow of dataFrom the perspective of the business userSummaryCHAPTER 13 ETL processing and DW 2.0Changing states of dataWhere ETLfitsFrom application data to corporate dataETL in online modeETL in batch modeSource and targetAn ETL mappingChanging states—an exampleMore complex transformationsETL and throughputETL and metadataETL and an audit 7218219219221222223223

ETL and data qualityCreating ETLCode creation or parametrically driven ETLETL and rejectsChanged data captureELTFrom the perspective of the business userSummaryCHAPTER 14 DW 2.0 and the granularity managerThe granularity managerRaising the level of granularityFiltering dataThe functions of the granularity managerHome-grown versus third-party granularity managersParallelizing the granularity managerMetadata as a by-productFrom the perspective of the business userSummaryCHAPTER 15 DW 2.0 and performanceGood performance—a cornerstone for DW 2.0Online response timeAnalytical response timeThe flow of dataQueuesHeuristic processingAnalytical productivity and response timeMany facets to performanceIndexingRemoving dormant dataEnd-user educationMonitoring the environmentCapacity planningMetadataBatch parallelizationParallelization for transaction processingWorkload managementData martsExploration facilitiesSeparation of transactions into classesService level 246247249249250250251253253254

Contents xiiiProtecting the Interactive SectorPartitioning dataChoosing the proper hardwareSeparating farmers and explorersPhysically group data togetherCheck automatically generated codeFrom the perspective of the business userSummaryCHAPTER 16 MigrationHouses and citiesMigration in a perfect worldThe perfect world almost never happensAdding components incrementallyAdding the Archival SectorCreating enterprise metadataBuilding the metadata infrastructure"Swallowing" source systemsETL as a shock absorberMigration to the unstructured environmentFrom the perspective of the business userSummaryCHAPTER 17 Cost justification and DW 2.0Is DW 2.0 worth it?Macro-level justificationA micro-level cost justificationCompany В has DW 2.0Creating new analysisExecuting the stepsSo how much does all of this cost?Consider company ВFactoring the cost of DW 2.0Reality of informationThe real economics of DW 2.0The time value of informationThe value of integrationHistorical informationFirst-generation DW and DW 2.0—the economicsFrom the perspective of the business 6277278279279280280281282282

xivContentsCHAPTER 18 Data quality in DW 2.0The DW 2.0 data quality tool setData profiling tools and the reverse-engineered data modelData model typesData profiling inconsistencies challenge top-down modelingSummaryCHAPTER 19 DW 2.0 and unstructured dataDW 2.0 and unstructured dataReading textWhere to do textual analytical processingIntegrating textSimple editingStop wordsSynonym replacementSynonym concatenationHomographic resolutionCreating themesExternal glossaries/taxonomiesStemmingAlternate spellingsText across languagesDirect searchesIndirect searchesTerminologySemistructured data/VALUE NAME dataThe technology needed to prepare the dataThe relational data baseStructured/unstructured linkageFrom the perspective of the business userSummaryCHAPTER 20 DW 2.0 and the system of recordOther systems of recordFrom the perspective of the business userSummaryCHAPTER21 Miscellaneous topicsData martsThe convenience of a data martTransforming data mart 0330430430530530530630630730730830930931031031 з319319321323323324325

Monitoring DW 2.0Moving data from one data mart to anotherBad dataA balancing entryResetting a valueMaking correctionsThe speed of movement of dataData warehouse utilitiesSummaryCHAPTER 22 Processing in the DW 2.0 5CHAPTER 23 Administering the DW 2.0 environmentThe data modelArchitectural administrationDefining the moment when an Archival Sector will be neededDetermining whether the Near Line Sector is neededMetadata administrationDatabase administrationStewardshipSystems and technology administrationManagement administration of the DW 2.0 environmentPrioritization and prioritization conflictsBudgetScheduling and determination of milestonesAllocation of resourcesManaging 8358359359359361Index363

Information factory development stream 133 Data profiling and mapping stream 133 Data correction stream 133 Infrastructure stream 133 Total information quality management stream 134 Summary 137 CHAPTER 7 Statistical processing and DW 2.0 141 Two types of transactions 141 Using statistical analysis 143 The integrity of the comparison 144