An Overview Of Business Intelligence Technology

Transcription

review articlesdoi:10.1145/1978542.1978562BI technologies are essential to runningtoday’s businesses and this technologyis going through sea changes.By Surajit Chaudhuri, Umeshwar Dayal,and Vivek NarasayyaAn Overviewof BusinessIntelligenceTechnologyis a collectionof decision support technologies for the enterpriseaimed at enabling knowledge workers such asexecutives, managers, and analysts to make betterand faster decisions. The past two decades have seenexplosive growth, both in the number of productsand services offered and in the adoption of thesetechnologies by industry. This growth has beenfueled by the declining cost of acquiring and storingvery large amounts of data arising from sourcessuch as customer transactions in banking, retailas well as in e-businesses, RFID tags for inventorytracking, email, query logs for Web sites, blogs, andproduct reviews. Enterprises today collect data at afiner granularity, which is therefore of much largervolume. Businesses are leveraging their data assetB usiness intel l i g e n c e ( BI ) soft wa r e88communications of th e ac m au g ust 2 0 1 1 vo l . 5 4 n o. 8aggressively by deploying and experimenting with more sophisticated dataanalysis techniques to drive businessdecisions and deliver new functionality such as personalized offers and services to customers. Today, it is difficultto find a successful enterprise thathas not leveraged BI technology forits business. For example, BI technology is used in manufacturing for order shipment and customer support,in retail for user profiling to targetgrocery coupons during checkout, infinancial services for claims analysisand fraud detection, in transportation

for fleet management, in telecommunications for identifying reasons forcustomer churn, in utilities for powerusage analysis, and health care foroutcomes analysis.A typical architecture for supporting BI within an enterprise is shown inFigure 1 (the shaded boxes are technology that we focus on in this article). Thedata over which BI tasks are performedoften comes from different sources—typically from multiple operational databases across departments within theorganization, as well as external vendors. Different sources contain dataof varying quality, use inconsistentrepresentations, codes, and formats,which have to be reconciled. Thus theproblems of integrating, cleansing,and standardizing data in preparationfor BI tasks can be rather challenging.Efficient data loading is imperative forBI. Moreover, BI tasks usually need tobe performed incrementally as newdata arrives, for example, last month’ssales data. This makes efficient andscalable data loading and refresh capabilities imperative for enterprise BI.These back-end technologies for preparing the data for BI are collectivelykey insights The cost of data acquisition and datastorage has declined significantly. Thishas increased the appetite of businessesto acquire very large volumes in order toextract as much competitive advantagefrom it as possible. New massively parallel data architectures and analytic tools go beyondtraditional parallel SQL data warehousesand OLAP engines. The need to shorten the time lagbetween data acquisition and decisionmaking is spurring innovations inbusiness intelligence technologies.au g u st 2 0 1 1 vo l . 5 4 n o. 8 c o m m u n ic at i o n s o f t he acm89

review articlesFigure 1. Typical business intelligence architecture.DatasourcesData movement,streaming enginesData nsSearchExternal DataSourcesOperationalDatabasesExtract TransformLoad (ETL)communi cations of th e ac mOLAPServerEnterprisesearch engineSpreadsheetComplex EventProcessing Enginereferred to as Extract-Transform-Load(ETL) tools. Increasingly there is a needto support BI tasks in near real time,that is, make business decisions basedon the operational data itself. Specialized engines referred to as ComplexEvent Processing (CEP) engines haveemerged to support such scenarios.The data over which BI tasks areperformed is typically loaded into arepository called the data warehousethat is managed by one or more datawarehouse servers. A popular choice ofengines for storing and querying warehouse data is relational database management systems (RDBMS). Over thepast two decades, several data structures, optimizations, and query processing techniques have been developed primarily for executing complexSQL queries over large volumes of data—a key requirement for BI. An example of such an ad hoc SQL query is: findcustomers who have placed an orderduring the past quarter whose amountexceeds the average order amount by atleast 50%. Large data warehouses typically deploy parallel RDBMS engines sothat SQL queries can be executed overlarge volumes of data with low latency.As more data is born digital, there isincreasing desire to architect low-costdata platforms that can support muchlarger data volume than that traditionally handled by RDBMSs. This is oftendescribed as the “Big Data” challenge.Driven by this goal, engines based onthe MapReduce9 paradigm—originallybuilt for analyzing Web documentsand Web search query logs—are nowbeing targeted for enterprise analyt90RelationalDBMSMapReduceengineDashboardData mining,text analyticenginesics. Such engines are currently beingextended to support complex SQL-likequeries essential for traditional enterprise data warehousing scenarios.Data warehouse servers are complemented by a set of mid-tier servers thatprovide specialized functionality fordifferent BI scenarios. Online analyticprocessing (OLAP) servers efficientlyexpose the multidimensional view ofdata to applications or users and enable the common BI operations suchas filtering, aggregation, drill-downand pivoting. In addition to traditional OLAP servers, newer “in-memoryBI” engines are appearing that exploittoday’s large main memory sizes todramatically improve performance ofmultidimensional queries. Reportingservers enable definition, efficient execution and rendering of reports—forexample, report total sales by regionfor this year and compare with salesfrom last year. The increasing availability and importance of text data such asproduct reviews, email, and call centertranscripts for BI brings new challenges. Enterprise search engines supportthe keyword search paradigm over textand structured data in the warehouse(for example, find email messages,documents, history of purchases andsupport calls related to a particularcustomer), and have become a valuabletool for BI over the past decade. Datamining engines enable in-depth analysis of data that goes well beyond whatis offered by OLAP or reporting servers,and provides the ability to build predictive models to help answer questionssuch as: which existing customers are au g ust 2 0 1 1 vo l . 5 4 n o. 8ReportingServerAd hocquerylikely to respond to my upcoming catalog mailing campaign? Text analytic engines can analyze large amounts of textdata (for example, survey responses orcomments from customers) and extract valuable information that wouldotherwise require significant manualeffort, for example, which products arementioned in the survey responses andthe topics that are frequently discussedin connection with those products.There are several popular frontend applications through which usersperform BI tasks: spreadsheets, enterprise portals for searching, performance management applications thatenable decision makers to track keyperformance indicators of the business using visual dashboards, toolsthat allow users to pose ad hoc queries, viewers for data mining models,and so on. Rapid, ad hoc visualizationof data can enable dynamic exploration of patterns, outliers and help uncover relevant facts for BI.In addition, there are other BI technologies (not shown in Figure 1) suchas Web analytics, which enables understanding of how visitors to a company’s Web site interact with the pages;for example which landing pages arelikely to encourage the visitor to makea purchase. Likewise, vertical packaged applications such as customerrelationship management (CRM) arewidely used. These applications oftensupport built-in analytics, for example, a CRM application might providefunctionality to segment customersinto those most likely and least likelyto repurchase a particular product.

review articlesAnother nascent but important area ismobile BI that presents opportunitiesfor enabling novel and rich BI applications for knowledge workers on mobile devices.In this short article, we are notable to provide comprehensive coverage of all technologies used in BI (seeChaudhuri et al.5 for additional detailson some of these technologies). Wetherefore chose to focus on technologywhere research can play, or has historically played, an important role. Insome instances, these technologies aremature but challenging research problems still remain—for example, datastorage, OLAP servers, RDBMSs, andETL tools. In other instances, the technology is relatively new with severalopen research challenges, for example,MapReduce engines, near real-time BI,enterprise search, data mining and textanalytics, cloud data services.Data StorageAccess structures. Decision supportqueries require operations such asfiltering, join, and aggregation. To efficiently support these operations,special data structures (not typicallyrequired for OLTP queries) have beendeveloped in RDBMSs, described here.Access structures used in specializedOLAP engines that do not use RDBMSsare discussed later.Index structures. An index enablesassociative access based on values of aparticular column. When a query hasone or more filter conditions, the selectivities of these conditions can beexploited through index scans (for example, an index on the StoreId columncan help retrieve all sales for StoreId 23) and index intersection (when multiple conditions exist). These operations can significantly reduce, and insome cases eliminate, the need to access the base tables, for example, whenthe index itself contains all columnsrequired to answer the query. Bitmapindexes support efficient index operations such as union and intersection. Abitmap index on a column uses one bitper record for each value in the domainof that column. To process a query ofthe form column1 val1 AND column2 val2 using bitmap indexes, we identifythe qualifying records by taking the bitwise AND of the respective bit vectors.While such representations are veryToday, it is difficultto find a successfulenterprise thathas not leveragedBI technology fortheir business.effective for low cardinality domains(for example, gender), they can also beused for higher cardinality domainsusing bitmap compression.Materialized views. Reporting queries often require summary data, forexample, aggregate sales of the mostrecent quarter and the current fiscalyear. Hence, precomputing and materializing summary data (also referredto as materialized views) can help dramatically accelerate many decisionsupport queries. The greatest strengthof a materialized view is its ability tospecifically target certain queries by effectively caching their results. Howeverthis very strength also can limit its applicability, that is, for a slightly different query it may not be possible to usethe materialized view to answer thatquery. This is in contrast to an index,which is a much more general structure, but whose impact on query performance may not be as dramatic asa materialized view. Typically, a goodphysical design contains a judiciousmix of indexes and materialized views.Partitioning. Data partitioning canbe used to improve both performance(discussed later) and manageability.Partitioning allows tables and indexesto be divided into smaller, more manageable units. Database maintenanceoperations such as loading and backupcan be performed on partitions ratherthan an entire table or index. The common types of partitioning supportedtoday are hash and range. Hybridschemes that first partition by rangefollowed by hash partitioning withineach range partition are also common.Column-oriented storage. Traditional relational commercial database engines store data in a row-oriented manner, that is, the values of all columnsfor a given row in a table are storedcontiguously. The Sybase IQ product30pioneered the use of column-orientedstorage, where all values of a particularcolumn are stored contiguously. Thisapproach optimizes for “read-mostly”workloads of ad hoc queries. The column-oriented representation has twoadvantages. First, significantly greaterdata compression is possible than ina row-oriented store since data valueswithin a column are typically muchmore repetitive than across columns.Second, only the columns accessed inthe query need to be scanned. In con-au g u st 2 0 1 1 vo l . 5 4 n o. 8 c o m m u n ic at i o n s o f t he acm91

review articlesFigure 2. Multidimensional data.thon MarFebJan ProductMDimensional ryQuarterCityProductWeekToothpaste NewYorkMonthDateL.A. Chicago Citytrast, in a row-oriented store, it is noteasy to skip columns that are not accessed in the query. Together, this canresult in reduced time for scanninglarge tables.Finally, we note that in the past decade, major commercial database systems have added automated physicaldesign tools that can assist database administrators (DBAs) in choosing appropriate access structures (see Chaudhuriand Narasayya7 for an overview) basedon workload information, such as queries and updates executed on the system, and constraints, for example, totalstorage allotted to access structures.Data Compression can have significant benefits for large data warehouses.Compression can reduce the amountof data that needs to be scanned, andhence the I/O cost of the query. Second,since compression reduces the amountof storage required for a database, it canalso lower storage and backup costs.A third benefit is that compression effectively increases the amount of datathat can be cached in memory sinc

BUsiness intelliGence (Bi) sOFtwareis a collection of decision support technologies for the enterprise aimed at enabling knowledge workers such as executives, managers, and analysts to make better and faster decisions.