New Trends In Data Warehousing And Data Analysis

Transcription

New Trends in Data Warehousingand Data Analysis

Annals of Information SystemsVolume 1:Managing in the Information Economy: Current Research IssuesUday Apte, Uday Karmarkar, eds.Volume 2:Decision Support for Global EnterprisesUday Kulkarni, Daniel J. Power, Ramesh Sharda, eds.

New Trends in Data Warehousingand Data AnalysisEditorsStanisław KozielskiRobert Wrembel13

EditorsStanisław KozielskiSilesian University of TechnologyGliwice, PolandStanislaw.Kozielski@polsl.plISSN: 1934-3221ISBN-13: 978-0-387-87430-2DOI: 10.1007/978-0-387-87430-2Robert WrembelPoznań University of TechnologyPoznań, PolandRobert.Wrembel@cs.put.poznan.ple-ISBN-13: 978-0-387-87431-9Library of Congress Control Number: 2008934456 2009 by Springer Science Business Media, LLCAll rights reserved. This work may not be translated or copied in whole or in part without the writtenpermission of the publisher (Springer Science Business Media, LLC, 233 Spring Street, New York,NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use inconnection with any form of information storage and retrieval, electronic adaptation, computersoftware, or by similar or dissimilar methodology now know or hereafter developed is forbidden.The use in this publication of trade names, trademarks, service marks and similar terms, even if theare not identified as such, is not to be taken as an expression of opinion as to whether or not they aresubject to proprietary rights.Printed on acid-free paperspringer.com

PrefaceA modern way of managing enterprises, institutions, and organizations is based onknowledge, that in turn, is gained from data analysis. In practice, business decisionsare taken based on the analysis of past and current data, continuously collectedduring the lifetime of an enterprise. A data analysis technology, widely acceptedby research and industry, is based on data warehouse architecture. In this architecture, data coming from multiple distributed and heterogeneous storage systems areintegrated in a central repository, called a data warehouse (DW). Such integrateddata are analyzed by the so-called On-Line Analytical Processing (OLAP) queriesand data mining applications for the purpose of: analyzing business performancemeasures, discovering and analyzing trends, discovering anomalies and patterns ofbehavior, finding hidden dependencies between data as well as predicting businesstrends and simulating business solutions. The DW and OLAP technologies are applied in multiple areas including among others: sales business, stock market, banking, insurance, energy management, telecommunication, medicine, and science.From a technical point of view, a data warehouse is a large database, whose sizeranges from several hundreds of GB to several dozens of TB or even several PB.The size of a DW, high complexity of OLAP queries and data mining algorithmsas well as the heterogeneous nature of data being integrated in a DW cause serious research and technological challenges. Intensive research has been conducted inmultiple fields including among others: (1) conceptual modeling of DWs and logicaldata models, (2) DW loading (refreshing), (3) assuring efficient execution of OLAPqueries and data mining algorithms, (4) managing materialized views, (5) data analysis techniques, (6) metadata management, (7) managing the evolution of DWs, (7)stream, real-time, and active data warehouses, and (8) warehousing complex data(e.g, XML, multimedia, object, spatial).Typically, a DW applies the so-called multidimensional data model. In thismodel, analyzed data, called facts, are referenced by multiple dimensions that setup the context of an analysis. Such multi-dimensional spaces are called data cubes.In practice, a data cube can be implemented either in relational OLAP (ROLAP)servers or in multidimensional OLAP (MOLAP) servers.v

viPrefaceA DW is filled with data by means of the so-called Extraction-TransformationLoading (ETL) processes. They are responsible among others for: extracting and filtering data from data sources, transforming data into a common data model, cleansing data in order to remove inconsistencies, duplicates, and null values, integratingcleansed data into one common data set, computing summaries, and loading datainto a DW.OLAP queries typically join multiple tables, filter and sort data, compute aggregates and use complex groupings. Since these queries are very complex and theyoften read terabytes of data, their execution may take dozens of minutes, hours, oreven days. Therefore, one of the most important research and technological problems concerns assuring an acceptable performance of OLAP queries. Well developed solutions to this problem are based on materialized views and query rewriting,advanced index structures as well as on parallel processing and data partitioningtechniques.Materialized views applied to storing precomputed results of OLAP queries areone of basic DW objects. Multiple solutions have been proposed for selecting optimal sets of materialized views for a given query workload, for efficient refreshingof materialized views as well as for assuring materialized view consistency duringa DW refreshing.In order to support OLAP processing, a standard SQL has been extended withmultiple clauses as well as with predefined specialized analytical and forecastingfunctions. Moreover, multiple data mining algorithms have been integrated into DWmanagement systems.In order to work properly and efficiently, all of the above mentioned techniquesneed to use metadata. Managing various types of metadata in DWs has also receiveda lot of attention, resulting in widely accepted industry standard CWM, supportedby major DW software vendors.An inherent feature of data sources is their evolution in time that concerns notonly their content but also their structures. The evolution of structures of datasources has an impact on deployed ETL processes as well as on the structure ofa DW. The most advanced approaches to handling the evolution of data sources arebased on temporal extensions and versioning mechanisms.Traditional DWs are refreshed periodically overnight and users execute their analytical applications after a DW refreshing. For some DW applications, however,such a DW usage is inappropriate. For example, monitoring the intensity of traffic,controlling physical parameters of technological processes, monitoring patients’ vital signs by means of various kinds of sensors require an on-the-fly analysis. Monitoring credit card usage in order to detect unauthorized usage requires frequent DWrefreshing and mechanisms of automatic data analysis. Such requirements led to thedevelopment of stream data analysis systems, active and real-time data warehouses.Modern information systems often store huge amounts of data in various Websystems. Typically, data are represented in these systems as XML and HTML documents, images, maps, and data of other complex structures. These data are as important as traditional text and numerical data, and therefore, there is a need to analyzethem in a similar way as in traditional DWs. This requirement motivates researchers

Prefaceviito design and build data warehouses from Web data sources and to provide OLAPfunctionality for XML documents, multimedia data, and spatial data.Despite substantial achievements in the DW and OLAP technologies that havebeen made for the past 30 years, DW and OLAP technologies still are and will bevery active areas of research and technological development.The aim of this special issue of the Annals of Information Systems is to presentcurrent advances in the data warehouse and OLAP technologies as well as to pointto the areas for further research. The issue is composed of 15 chapters that in generaladdress the following research and technological areas: advanced technologies forbuilding XML, spatial, temporal, and real-time data warehouses, novel approachesto DW modeling, data storage and data access structures, as well as advanced dataanalysis techniques.Chapter 1 overviews open research problems concerning building data warehouses for integrating and analyzing various complex types of data, dealing withtemporal aspects of data, handling imprecise data, and ensuring privacy in DWs.Chapter 2 discusses challenges in designing ETL processes for real-time (or nearreal-time) data warehouses and proposes an architecture of a real-time DW system.Chapter 3 discusses data warehouse modeling techniques, based on multidimensional modeling. In particular, the chapter covers conceptual, logical, and physicalmodeling.Chapter 4 proposes an approach to personalizing a multidimensional database.The authors present a model and a language that allow to define user preferences onschema elements. User preferences are expressed by means of weights associatedwith schema elements and they express user interest in data.Chapter 5 covers designing spatial (geographical) data warehouses. It proposes ametamodel for the support of the design of spatial dimensional schemas.Chapter 6 presents a technique for approximate answering range-sum queries ondata cubes. To this end, the authors propose tree-based data structures for separatelystoring sampled data and outliers data. The proposed technique assures a good quality of approximate answers.Chapter 7 addresses a problem of summarizing multidimensional search spaces,called data cubes. The authors propose the concept of the so-called closed cube,which is a cover for a data cube. The authors show that the closed cube is smallerthan its competitor, i.e. a quotient cube, and it can be used for deriving a quotientcube.Chapter 8 analyzes multiple index structures for multiversion data warehouses.In particular, the paper describes how to extend index structures designed for datawith linear evolution in order to handle data with branched evolution and it providesan analytical model for comparing various index structures for multiversion DWs.Chapter 9 discusses the application of WAH compressed bitmap indexes to indexing text data for full-text search. The chapter also presents performance characteristics of the proposed indexing technique in three systems, namely MySQL,FastBit, and MonetDB.Chapter 10 proposes the optimization of OLAP queries by means of applyinghorizontal partitioning of tables and bitmap join indexes. The partitioning schema

viiiPrefaceand the set of bitmap indexes are selected by means of genetic and greedy algorithms. The proposed optimization techniques are validated experimentally.Chapter 11 discusses the application of the x-BR-tree index to spatial data. Thechapter provides an analytical cost model of spatial queries with the support of thex-BR-tree index, followed by its experimental evaluation.Chapter 12 proposes a formal model for representing spatio-temporal and nonspatial data about moving objects. Based on the model, the authors propose a querylanguage allowing to query such data. The main idea is based on replacing objecttrajectories by sequences of object stops and moves.Chapter 13 addresses the problem of data mining in a multidimensional spacein a data warehouse. The authors propose a compact representation of sequentialpatterns, called closed multidimensional sequential patterns, which allows to reducethe search space. The proposed representation and mining algorithms are followedby experimental evaluation.Chapter 14 presents issues on modeling and querying temporal semistructureddata warehouses. Such a DW is modeled as a graph with labeled nodes and edges.The temporal aspect is added to the graph by means of labels denoting validitytimes. The model is supported with a query language based on path expressions.Chapter 15 contributes a multidimensional data model of a data warehouse,called "galaxy", that supports the analysis of XML documents. The authors alsopropose a technique and a tool for integrating XML documents and loading theminto a DW.AcknowledgementsThe editors would like to acknowledge the help of all involved in the reviewprocess of this special issue of the Annals of Information Systems. The reviewersprovided comprehensive, critical, and constructive comments. Without their supportthe project could not have been satisfactorily completed.The alfabetically ordered list of reviewers includes: Carlo Combi, University of Verona, ItalyKaren Davis, University of Cincinnati, USAPedro Furtado, University of Coimbra, PortugalMarcin Gorawski, Silesian University of Technology, PolandCarlos Hurtado, Universidad de Chile, ChileKrzysztof Jankiewicz, Poznań Unviersity of Technology, PolandRokia Missaoui, Universite du Quebec en Outaouais, CanadaTadeusz Morzy, Poznań Unviersity of Technology, PolandMikolaj Morzy, Poznań Unviersity of Technology, PolandStefano Rizzi, University of Bologna, ItalyAlkis Simitsis, IBM Almaden Research Center, USA

Prefaceix Jerzy Stefanowski, Poznań Unviersity of Technology, Poland Alejandro Vaisman, University of Buenos Aires, Argentina Kesheng Wu, University of California, USAThe editors would also like to thank Professor Ramesh Sharda for his invitationto guest edit this special issue.Gliwice and Poznań, PolandJune 2008Stanisław KozielskiRobert Wrembel

Contents12Warehousing The World: A Vision for Data Warehouse Research . . .Torben Bach Pedersen1.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.2Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.2.1Data Analysis Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.2.2Point-To-Point Integration . . . . . . . . . . . . . . . . . . . . . . . . . .1.2.3Integration Using Data Warehouses . . . . . . . . . . . . . . . . . .1.2.4Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.3Challenges From New Types Of Data . . . . . . . . . . . . . . . . . . . . . . . .1.4Current Approaches To Integrating Different Types Of Data . . . . . .1.4.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.4.2An Example: Multidimensional Text Integration . . . . . .1.4.3Taking A Step Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.5Warehousing The World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.5.1The World Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1.5.2World Warehouse Challenges . . . . . . . . . . . . . . . . . . . . . . .1.6Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Near Real Time ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Panos Vassiliadis and Alkis Simitsis2.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2Traditional ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.2.1Operational issues and challenges . . . . . . . . . . . . . . . . . . . .2.2.2Variations of the traditional ETL architecture . . . . . . . . . .2.3The case for Near Real Time ETL . . . . . . . . . . . . . . . . . . . . . . . . . . .2.3.1Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.3.2General architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.3.3Industrial approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.3.4The infrastructure of near real time ETL . . . . . . . . . . . . . .2.3.5Research and engineering issues for near real time ETL .11334578910111112121416161919212224252526283133xi

xii345Contents2.3.6Summing up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.4Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2.5Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43434848Data Warehousing Meets MDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Jose-Norberto Mazón and Juan Trujillo3.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.2Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.3Multidimensional Modeling of Data Warehouses with MDA . . . . .3.3.1A Multidimensional Computation Independent Model . . .3.3.2A Multidimensional Platform Independent Model . . . . . .3.3.3Reconciling a Multidimensional Model with Data Sources3.3.4Multidimensional Platform Specific Models . . . . . . . . . . .3.3.5From PIM to PSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.4Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3.5Conclusions and Ongoing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51Personalization and OLAP Databases . . . . . . . . . . . . . . . . . . . . . . . . . .Franck Ravat and Olivier Teste4.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.1.1Paper Issue and Related Works . . . . . . . . . . . . . . . . . . . . . .4.1.2Contributions and Paper Outline . . . . . . . . . . . . . . . . . . . . .4.2Personalized Multidimensional Modelling . . . . . . . . . . . . . . . . . . . .4.2.1Concepts and formalisms . . . . . . . . . . . . . . . . . . . . . . . . . . .4.2.2Personalization Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3Personalized OLAP Manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3.1Multidimensional Table . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3.2Extended OLAP Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . .4.3.3Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.4A Personalized Multidimensional Database System . . . . . . . . . . . . .4.4.1Database System Architecture . . . . . . . . . . . . . . . . . . . . . . .4.4.2R-OLAP storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.4.3A Rule Definition Language . . . . . . . . . . . . . . . . . . . . . . . .4.4.4Personalization Process . . . . . . . . . . . . . . . . . . . . . . . . . . . .4.5Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687889091A Metamodel for the Specification of Geographical DataWarehouses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93Valéria Cesário Times, Robson do Nascimento Fidalgo, Rafael Leão daFonseca, Joel da Silva, and Anjolina Grisi de Oliveira5.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2Formal Definitions for a Geographical Data Warehouse Metamodel 955.3A Metamodel for GDW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Contentsxiii5.4A CASE Tool for Modelling Geographical Data Warehouses . . . . . 1035.5Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075.6Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.7Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136Pushing Theoretically-Founded Probabilistic Guarantees inHighly-Efficient OLAP Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115Alfredo Cuzzocrea and Wei Wang6.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.2Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3TP-Tree Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206.4Efficiently Detecting, Indexing, and Querying Outliers ofMultidimensional Data Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1226.5TP-Tree: A Tree-Like, “Self-Adjusting” Synopsis Data Structurefor Approximate Query Answering in OLAP . . . . . . . . . . . . . . . . . . 1256.5.1Tunable Partitions of Data Cubes . . . . . . . . . . . . . . . . . . . . 1256.5.2TP-Tree Data Organization . . . . . . . . . . . . . . . . . . . . . . . . . 1266.5.3Updating the TP-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1276.5.4Deriving the Theoretical Bound . . . . . . . . . . . . . . . . . . . . . . 1296.6Querying the TP-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.7Experimental Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326.7.1Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336.7.2Synthetic Query-Workloads against Synthetic Data Cubes 1376.7.3Synthetic Query-Workloads against Real Data Cubes . . . . 1396.7.4Benchmark Queries against Real Data Cubes . . . . . . . . . . 1406.7.5Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.7.6Construction Cost with Synthetic Query-Workloadsagainst Synthetic Data Cubes . . . . . . . . . . . . . . . . . . . . . . . . 1416.8Conclusions and Future Research Efforts . . . . . . . . . . . . . . . . . . . . . 142References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427Closed Cube Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145Alain Casali, Sebastien Nedjar, Rosine Cicchetti, and Lotfi Lakhal7.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457.2Cube Lattice Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1487.3Closed Cube Lattices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1517.4Lattice Isomorphism and Algorithmic Aspects . . . . . . . . . . . . . . . . . 1537.5Relationships between Closed Cubes and Quotient Cubes . . . . . . . . 1547.6Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567.7Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

xivContents8Design and Analysis of Index Structures in MultiVersion DataWarehouses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165Khaled Jouini and Geneviève Jomier8.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1658.2Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678.2.1DW Versions and Entity Versions . . . . . . . . . . . . . . . . . . . . 1678.2.2DW Version Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1698.3Indexing Data in MultiVersion Data Warehouses . . . . . . . . . . . . . . . 1698.3.1Primary Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1698.3.2Secondary Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1738.4Analysis and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.4.1Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.4.2Steady State Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1748.4.3The B V-tree as primary index . . . . . . . . . . . . . . . . . . . . . . 1768.4.4The OB tree as primary index . . . . . . . . . . . . . . . . . . . . . . . 1778.4.5The BT-Tree as primary index . . . . . . . . . . . . . . . . . . . . . . . 1788.4.6Secondary indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1798.5Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1808.5.1Storage Cost of Primary Indexes . . . . . . . . . . . . . . . . . . . . . 1818.5.2Query Cost of Primary Indexes . . . . . . . . . . . . . . . . . . . . . . 1828.5.3Query Cost of Secondary Indexes . . . . . . . . . . . . . . . . . . . . 1828.6Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1838.7Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1849Using Bitmap Index for Joint Queries on Structured and Text Data . 187Kurt Stockinger, John Cieslewicz, Kesheng Wu, Doron Rotem, andArie Shoshani9.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1889.2Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1899.2.1Indexing techniques for structured data . . . . . . . . . . . . . . . 1899.2.2Database Systems for Full-Text Searching . . . . . . . . . . . . . 1919.2.3Compressing Inverted Files . . . . . . . . . . . . . . . . . . . . . . . . . 1929.3Case Study: The Enron Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1939.4Extending Bitmap Indexes to Support Full Text Search . . . . . . . . . . 1959.5Integrating FastBit into MonetDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1969.5.1Why MonetDB? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1969.5.2Integrating MonetDB and FastBit . . . . . . . . . . . . . . . . . . . . 1989.6Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1989.6.1Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1999.6.2Size of Bitmap Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2009.6.3Query performance on structured data . . . . . . . . . . . . . . . . 2019.6.4Query Performance for Text Searching . . . . . . . . . . . . . . . . 2039.6.5Query Performance for both Numerical and Text Data . . . 2049.7Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

ContentsxvReferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20710HP&BJI: A Combined Selection of Data Partitioning and JoinIndexes for Improving OLAP Performance . . . . . . . . . . . . . . . . . . . . . . 211Kamel Boukhalfa, Ladjel Bellatreche, and Zaia Alimazighi10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21210.2 Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21410.2.1 Methodology to Horizontal Partition Relational DataWarehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21510.2.2 Horizontal Partitioning Selection Process . . . . . . . . . . . . . . 21610.2.3 Coding Fragmentation Schema . . . . . . . . . . . . . . . . . . . . . . 21710.2.4 Limitation of the Proposed Coding . . . . . . . . . . . . . . . . . . . 21810.2.5 Effect of Horizontal Partitioning on Queries . . . . . . . . . . . 21910.3 Bitmap Join Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21910.3.1 Complexity Study of Selecting BJIs . . . . . . . . . . . . . . . . . . 22010.3.2 Formulation of BJI Selection Problem . . . . . . . . . . . . . . . . 22110.4 Identification of Similarity between Horizontal Partitioning andBJIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22110.5 Description of HP&BJI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22310.5.1 Formalization of the Combined Selection . . . . . . . . . . . . . 22410.5.2 Genetic Algorithm for Selecting Horizontal Schema . . . . 22510.5.3 Greedy Algorithm for Generating BJIs . . . . . . . . . . . . . . . . 22510.6 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22610.6.1 Evaluation of four Optimization Approaches . . . . . . . . . . . 22710.6.2 Validation on ORACLE10g . . . . . . . . . . . . . . . . . . . . . . . . . 22910.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23211Cost Model for X-BR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235Marcin Gorawski and Marcin Bugdol11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23511.2 X-BR-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23611.2.1 X-BR-tree structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23611.2.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23811.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24211.3.1 The Number of Leaves of X-BR-tree . . . . . . . . . . . . . . . . . 24211.3.2 Estimation of Cost Model . . . . . . . . . . . . . . . .

New Trends in Data Warehousing and Data Analysis . Annals of Information Systems Volume 1: . warehousing complex data (e.g, XML, multimedia, object, spatial). Typically, a DW applies the so-called multidimensional data model. In this model, analyzed data, called facts, are referenced by multiple dimensions that set up the context of an .