Sybase IQ Bloor Report 2011

Transcription

InDetailSybase IQ 15.3An InDetail Paper by Bloor ResearchAuthor : Philip HowardPublish date : June 2011

it is our view that SybaseIQ merits careful review byorganisations investigatingdata warehousing, highspeed analytics and businessintelligence optionsPhilip Howard

Sybase IQ 15.3Executive summaryFast factsSybase IQ is a column-based relational database that has been designed specifically foranalytics and business intelligence applications. It can offer a number of very significantadvantages within a data warehousing environment, including performance, scalability andcost of ownership benefits, when compared toconventional approaches.Although the product has broad applicability, the company targets a number of primarymarkets. These include: Data warehouses for aggregators of data,typically those who offer multi-client dataand analytics services. Advanced analytics—where there is a significant requirement to support complexand unpredictable queries, either as a datawarehouse in its own right or as an analyticsaccelerator, where Sybase IQ is complementary to an existing warehouse. As a report accelerator, offloading high performance reports from operational databases or centralised data warehouses thatperform too slowly to meet business needs. Applications where there are significantamounts of unstructured elements (for example, documents and images for insuranceclaims) and where very large quantities ofdata need to kept on-line—that may be subject to querying—for a considerable periodof time.More generally, and bearing in mind that Sybase is an SAP company, Sybase IQ is targetedat environments where users have heterogeneous or non-SAP application environments.Where SAP is the dominant provider of applications these accounts will be targeted by SAPBW and SAP’s in-memory technology productfor data management, HANA.The benefits associated with Sybase IQ arepredicated upon its column-based approach,its scalable grid architecture, and the performance enhancements (often measured inorders of magnitude) that it can offer, whencompared to row-based database solutions,while requiring fewer hardware resources.This is especially true where queries are complex or require large table scans and, in thelatter case, this has the knock-on advantageA Bloor InDetail Paper1that you do not have to pre-aggregate data,which represents both a performance and amanagement saving when compared to traditional approaches to data warehousing. Thereduced size of Sybase IQ data warehouses(along with other features of the product) alsomeans that Sybase IQ has the potential to offersignificant performance advantages when scaling for large numbers of users.In other words, Sybase aims to provide betterperformance with a lower total cost of ownership. Moreover, apart from the fact that data isstored by column, in all other respects SybaseIQ acts exactly like a conventional relationaldatabase. For instance, you use standard SQL,hardware and operating systems: databaseschemas are (or may be) the same, as areapplications; and training requirements aresimilar, in that you can add a column as easilyas a row, and so on.Parallelism in Sybase IQParallelism is particularly important in supporting high performance query processingfor large and complex environments. In theoriginal 15.0 release of Sybase IQ, the company introduced multiple write nodes into itsMultiplex architecture (it had previously onlysupported multiple read nodes) along withquery parallelisation and pipeline parallelisation (which will speed up both query and loadperformance). Now, in this release, the company has extended its parallel capabilities stillfurther with the introduction of what Sybasecalls its PlexQ architecture, which extendsSybase IQ into supporting massively parallelprocessing (MPP). Previously, you could parallelise a query within a node and you could havequeries running in parallel on different nodesbut now you can also have a single query distributed across multiple nodes. This can havesignificant performance advantages for appropriate queries.Key findingsIn the opinion of Bloor Research, the followingrepresent the key facts of which prospectiveusers should be aware: In addition to its column-based storage,Sybase IQ delivers a number of specialisedindexes in order to further accelerate ad hocquery performance. These include indexesfor low cardinality data (which further reduces storage requirements, and improves 2011 Bloor Research

Sybase IQ 15.3Executive summaryquery performance, through the use of token isation), grouped data, range data, joinedcolumns, real-time comparisons for Webapplications, date, and time analysis. In addition, there are textual analysis indexes(providing analytics on unstructured data thatmay be combined with structured analysis),with capabilities that have been extended inthis release through the introduction of indatabase text search and analytics. Both multi-threading and 24/7 high availability features (including partnerships withrelevant storage vendors for high availability and disaster recovery) are availablewith Sybase IQ. In particular, separate readand write nodes allow for procedures to beexecuted in parallel, without affecting oneanother. Read nodes are particularly usefulfor data aggregators offering multi-clientanalytics services because a node can beassigned to an individual account for laterchargeback. The PlexQ architecture (see above) is augmented by the Sybase IQ optimiser, whichhas been enhanced in this release to recognise when a query will benefit from beingdistributed and to which nodes the queryshould be distributed. Sybase IQ offers significant performanceadvantages when compared to conventionalapproaches. Apart from the features alreadymentioned, it also supports Rcube flat schemas that can provide major benefits whencompared to conventional star schemas. Inparticular, Rcubes can significantly speed upimplementation, as well as improve run-timeperformance and provide increased flexibility. In addition, Sybase IQ allows on-the-flychanges to schema attributes (columns);that is, you can add/delete columns in a tablewhile the Sybase IQ server is up and running. Sybase IQ provides standard ODBC/JDBC/OLE-DB connections to its query engine,thereby enabling access from any standards-based front-end BI tool. Sybase IQ iscertified to work with most industry leadingtools such as SAP BusinessObjects, Cognos,MicroStrategy, QlikView, iDashboards, SAS,SPSS, KXEN and others. As part of SAP, Sybase IQ is being optimised for use withSAP BusinessObjects. Sybase IQ also offers optimised ETL (and ELT)capabilities as a separate add-on product enabling developers to quickly build and deploytheir data sets for analysis on Sybase IQ. Alternatively, Sybase IQ is also certified to workwith leading third party ETL tools and, again,there are specific optimisations built for ETLfunctionality inside SAP BusinessObjectsData Services. Sybase IQ also supports loading data directly from a client, bypassing theDBA and removing the need for him or her tointervene in the loading process. This is important for environments where, for securityor confidentiality reasons, the DBA shouldnot be able to see the data. For situations where operational data thatyou want to query cannot (for compliance orother reasons) be loaded into the data warehousing environment, Sybase IQ supportsquery federation with data held in SybaseASE, Oracle, MySQL, SQL Anywhere, andSQL Server environments. Sybase provides column-based encryp- Sybase IQ can also be loaded on a continuoustion capabilities as well as database-levelencryption. This is particularly importantfor data aggregators with multi-client services where you want to be able to encryptdifferent customer’s data using differentalgorithms. Encryption is supported for bothdata-at-rest and data-in-flight.near-real time basis using an infrastructurecomprised of Sybase Replication Server(which has real-time loading capabilities),and a set of scripts generated by SybasePowerDesigner. Sybase IQ can also be loaded on a real-time basis with events data viaSybase ESP (Event Streaming Platform). Simultaneous loading and querying is providedvia Sybase IQ’s versioning capability—a newversion is created for the load process whilethe queries run on the older version until thenew load is committed. In-database analytics provides much betterperformance than traditional approachesto analytics. It is supported through the useof user-defined functions that are treated 2011 Bloor Researchas part of Sybase IQ’s relevant SQL functions. This means that the analytic processes can take full advantage of SybaseIQ’s optimiser and parallel capabilities. Thecompany has a partnership (and there willbe others) with Fuzzy Logix to exploit thesecapabilities in Sybase IQ and with VisualNumerics for Sybase RAP (Sybase RAPembeds Sybase IQ).2A Bloor InDetail Paper

Sybase IQ 15.3Executive summary Information lifecycle management, whichsupports the archival of data from frontline to near-line to historic storage, wasintroduced with Sybase 1Q 15. The ability toformally build data retention rules is supported by means of PowerDesigner, a leadingmodelling and metadata management tool,and WorkSpace Data Analytics, which is anEclipse-based development environment thatsupports database, reporting and analyticdevelopment in conjunction with Sybase IQ.The bottom lineUnlike standard row-based databases thatwere originally designed for online transactionprocessing, Sybase IQ has been engineeredspecifically for query processing and ad hocanalysis. As a result, Sybase IQ can offer significant performance and total cost of ownership advantages over traditional products forquery-intensive computing requirements. Interms of performance, these benefits havebeen significantly extended in this release withthe introduction of the PlexQ architecture.These advantages will be most obvious in environments where the query loads are unpredictable, composed largely of ad hoc enquiries.In this scenario, traditional databases cannotbe pre-tuned for unexpected queries, but thecolumn-based approach used by Sybase IQprovides highly effective self-tuning capabilities. In addition, complex queries that involvemultiple selection criteria across a variety oftables, and those involving large table scans,can be deployed much more efficiently withina column-based environment. Finally, SybaseIQ scales well for large data stores containingfinely detailed transactions and sub-transactions, such as clickstream data. Sybase IQdoes not require data to be pre-aggregatedfor analysis, allowing users to efficiently andquickly analyse atomic level data.Sybase IQ’s underlying columnar architectureis more efficient: servers have to do much lesswork to answer any particular query whendata is organised by column. Of course, Sybasehas augmented this basic design advantagein other ways but this is its key differentiator, along with its reduced cost of ownership.This is achieved in two ways: first throughlower absolute price, which is both a functionof this improved performance as well as thecompression techniques that Sybase IQ applies against individual columns. This reducesdisk requirements and, consequently, theA Bloor InDetail Paper3necessary investment in hardware. Secondly,this combines with the product’s reduced administration and tuning requirements to produce significantly less management overheadwhen compared with that of the traditionalenterprise data warehouse vendors.The data warehouse landscape, however, isnow no longer the sole domain of the traditionalsuppliers. Today, there is significant interest indata warehouse appliances. In one sense, themarket entry of these specialists has helpedSybase IQ because it has called into questionthe dominance exerted by the major providers of row-based databases as organisationsre-think their available options. It has alsovalidated the column-based approach. Therise of purpose-built appliances has raised theinterest in Sybase IQ for the report ‘accelerator’ market in particular. However, interest indata warehouse appliances also introduces anew class of competitors that claim similarperformance improvements with an excellentease of use profile.Where Sybase has an advantage over these appliance vendors is that Sybase IQ offers moreflexible tuning abilities. Most appliance vendors eschew the use of indexes completely oronly use them in very limited circumstances.Sybase IQ provides a range of indexing options.While these increase the amount of administration required, they make the product farmore adaptable to variations in data cardinalityand datatypes. Also, a built-in index advisorremoves some of the mystery around selectionand creation of appropriate indexes. Further,we know of no data warehouse appliance thatis currently able to support text analytics (asSybase IQ does) and the ability to managemixed query workloads is also typically limitedwhen compared to what Sybase can offer. Itis also worth noting that Sybase IQ can easilyand linearly scale up to support a large dataset and user workload. Additional disk spacecan be added to the shared disk pool to supportgrowing data sets and read and write nodesmay be independently added to the Sybase IQgrid in small increments to support an expanding user base.To conclude, Sybase IQ is well positioned tocompete with both the traditional and appliancevendors. While it has different advantages in different environments it is our view that SybaseIQ merits careful review by organisationsinvestigating data warehousing, high-speedanalytics and business intelligence options. 2011 Bloor Research

Sybase IQ 15.3The productThe current version number of Sybase IQ isversion 15.3. Supporting products, includingSybase IQ InfoPrimer (formerly Sybase ETL),Sybase PowerDesigner, Sybase ReplicationServer and the WorkSpace Database AnalyticsIDE, all leverage the features available withinSybase IQ. In the case of Sybase IQ InfoPrimerthis is specifically optimised to be used inconjunction with Sybase IQ and only supportsSybase IQ as a target.The product runs under Windows, Linux (RedHat and SuSE) and the leading UNIX operatingsystems from HP, IBM and Sun Microsystems(Oracle). Language support for analytic developers includes Perl, Python, PHP, ADO.Net,OLE-DB, and Ruby on Rails amongst others.Logical architectureSybase IQ has been designed specifically fordata warehousing. That is to say, it is not optimised for transaction processing and, therefore, it does not include the sort of facilities youwould need for transaction processing, as opposed to data warehousing. This is importantbecause the leading merchant databases offerboth sets of capabilities so that Sybase IQ hasa smaller footprint and is less complex thanthese offerings.Sybase IQ also differs from merchant databasesin that it is a column-based relational databaserather than a row-based relational database.The latter is required for transaction processingwhere individual records (rows) are constantlybeing inserted into the database and updated.Conversely, column-based databases havesignificant advantages when it comes to queryprocessing because each column is, effectively,an index but without any of the overheads associated with defining and storing those indexes.Moreover, every column is indexed in thissense, something that would never normallybe possible when using a row-based approach.That said, Sybase also supports a number ofindex types that you can optionally implement:these are discussed later.Another major advantage of a column-basedapproach is simply the amount of data thatneeds to be read for each query. Whenever youaccess data for a query from a conventionaldatabase, you read each row in its entirety,regardless of the actual fields that you are interested in for that specific query. In practice,this might mean reading a 3000 byte recordto retrieve just 20 characters of data but by 2011 Bloor Research4reading data on a columnar basis, you onlyhave to read what is specifically needed for thequery at hand. Of course the difference in performance when you are reading a single recordwill be negligible, but many queries requirefull table scans. Multiply that single read bya few million rows per table and the performance difference is very significant.A further consequence of using a columnbased approach is that you typically do not employ conventional horizontal partitioning, whichis predicated upon a row-based approach.Instead, Sybase IQ implements vertical partitioning: partitioning by column rather than byrow. One of the advantages of this approach isthat partitions can never become unbalanced,since there will always be the same number offields in each column of a table. This significantly reduces the maintenance requirementof managing partitions and should eliminatethe database re-organisation that may becomenecessary when conventional partitions becomeunbalanced and start to impair performance.In addition, columns are easy to compressbecause you can have different algorithms fordifferent datatypes. As a result, Sybase has hadan historic advantage over the merchant databases in this area. This has been whittled awaynow, though Sybase IQ still has advantages. Inparticular, Sybase claims that a Sybase IQ datawarehouse will never exceed the size of theraw data: this is by no means the case with itsmerchant rivals.Physical architectureThe architecture of Sybase IQ is illustrated inFigure 1. It is a shared everything massivelyparallel architecture, with every node connected to every other node in a full mesh providedby the interconnect. This reduces I/O issues andimproves resilience. The only exception to theshared everything approach is that each nodecan have its own, local, temporary storage. Thebig advantage of offering shared everything,and shared disks in particular, is that you donot have to distribute your data across the various disks in use, thereby removing what can bea significant administrative headache.Each node in the Sybase IQ environment isdesignated as either a read/write node or aread only node. In the case of the former, eachnode can be flexibly designated as a read or awrite node, as required. Thus, if you are running a large overnight batch update you mightA Bloor InDetail Paper

Sybase IQ 15.3The productFigure 1: Sybase IQ 15.3 PlexQ architecturewant all of your read/write nodes to operate aswrite nodes, but have them running as readnodes during the day. In addition, you can addnew nodes as needed, dynamically, so that youcan scale up incrementally.Nodes (servers) can be linked into a logicalserver, as shown. In addition, one logical servercan “loan” nodes to other logical servers on ascheduled basis, for example to support over night batch loading.This approach to logical servers supportsmixed query workloads because you can assign particular queries to a logical server andthat query can then only use the resourcesavailable to that logical server. How many logical servers, and the number of nodes withineach group, is entirely up to you. A graphicaladministration tool (see later) is provided tosupport the creation of these logical groupings, add or remove nodes, designate read onlyor read/write nodes and so on.Each logical server can have its own logins sothat specific users, departments or queriescan always be directed to a designated logicalserver.When a query is received the receiving nodeis designated as the “leader” node and, if theoptimiser (see below) decides that this querycan benefit from distribution, then the othernodes to be involved in processing this queryare designated as “worker” nodes. Any nodeA Bloor InDetail Paper5can be a leader or worker but only one nodecan be a leader for any particular query.There are a number of functions specifically tosupport high speed loading. The first is that theproduct supports pipeline parallelism so that,where indexes have separate data structures(which apply to Word and High Group indexes—see later), these can be updated at the sametime as data is being loaded.It was historically the case that, in many environments, the data was loaded to the serverand then the database administrator pushedthe data to the warehouse. However, thismeans that the DBA can see the data, whichis not acceptable in many environments (forexample, if the warehouse is outsourced), andmany clients want to load the data directly. Inorder to support this, Sybase supports a “loadfrom client” option that supports the loading ofboth data and LOBs (large Objects) via DBLib.Another major feature is support for information lifecycle management (ILM). There arespecific features to support different forms ofstorage for archival purposes: adding nearline and historical storage capabilities to activedata. These can be designated as read-onlystores for compliance purposes, if required,and, similarly, you can apply different securitypolicies to each store. There are specific facilities provided to support time-based retentionperiods with data first being marked as readonly and then dropped. 2011 Bloor Research

Sybase IQ 15.3The productAnother point to note about this architectureis the advantage that it offers to data aggregators and resellers (which represent a targetmarket for Sybase), because it means thateach subscriber can have its own read andread/write nodes, separate from anyone else,which obviously has beneficial security as wellas chargeback implications; and it also allows you to define different service levels fordifferent users. Further, Sybase IQ allows youto encrypt data on a column-by-column basis,which further reinforces this message. In fact,the product supports three levels of encryption: RSA and strong encryption (ECC) for datain flight plus RSA and RSA with FIPS 140-2strong encryption for data at rest.Should any node fail, you can switch users orresponsibilities to another node. Hot standby,failover and load-balancing are possible acrossnodes. These functions are not automated butare under the DBA’s control, which allows theDBA to dynamically allocate resources basedupon business needs. In addition, there is anOpenSwitch load balancing application available, if required, that operates at the application server level. Sybase IQ InfoPrimer alsohas load-balancing capabilities for Sybase IQdata loading tasks. It is further worth notingthe company’s partnerships with a number ofstorage hardware vendors to further ensurehigh availability and disaster recovery. Thereis support for range partitioning and you canpartition, re-partition, join, rename, split anddrop table partitions as required.There is also a NonStopIQ HA-DR methodology, which typically employs a local SAN and aremote one, with either synchronous or asynchronous communications between them. Thebig advantage of this is not just that it providesdisaster recovery but also that it eliminatesthe need to take the system down—even forplanned outages. Note that, as more and morecompanies adopt operational BI and embedquery capability into operational applications,then the warehouse increasingly becomesas mission-critical as those applications, forwhich you need a solution such as this.There is also support for query federation.This is intended for environments where operational data cannot be moved or copied fromtheir source systems for compliance or otherreasons but which you may want to include inqueries or reports. In this sort of environment,the amount of data to be sourced from theoperational system is typically small so near 2011 Bloor Research6real-time support can be achieved. The queryfederation technology supports Sybase ASE,Oracle and Microsoft SQL Server.One element of the physical architecture thatis not illustrated above is that the product nowhas a built-in web server to enable participation in web services.Query performanceThe Sybase IQ optimiser has been significantlyenhanced in this release to take advantage ofthe newly introduced MPP-based capabilitiesof the product. Previously, the optimiser onlyhad to consider the degree of parallelism thatwould be useful within a node. However, witha massively parallel architecture you havethe ability to distribute queries across nodes(within a logical server if you have one). Sothe first thing that the optimiser does is todetermine whether the query will benefit fromthis sort of parallelism. Not all queries do. Forexample, if the query is I/O bound then extraprocessing capacity may make an insignificantimpact on query performance. Similarly, theoverhead involved in distributing a query maybe deleterious for, say, a short running query.As another example, if a query cannot makeuse of all the resources available on singleserver then, again, it will probably not makesense to distribute the query. The optimisermay decide that no part of a particular queryshould be distributed, that the whole query canusefully be distributed or that a part or parts(fragments) can be distributed.Within each node that is executing a query,threads are allocated dynamically with threadsadded or removed as the query executes.Threads are scaled up or down according toworkload and resource availability. As notedpreviously, physical servers can be dynamically allocated to a logical grouping.Other notable features include the ability todetermine how much parallelism to apply towhich tasks and sub-query correlations, optimised use of temp space, operations that canrun directly off compressed data without requiring decompression, and concurrent work loadmanagement.IndexesAlthough every column is, in effect, its ownindex, there are substantial advantages to using specific indexes in a number of situations.A Bloor InDetail Paper

Sybase IQ 15.3The productThis is one area where Sybase has a majoradvantage over appliance vendors. Indeed, thesecret of Sybase IQ is its indexing capabilities.As Sybase customers discover new needs foranalysis, Sybase can simply create new indextypes to meet those needs. The beauty of thisapproach is that new indexes can be added tothe data warehouse with little, if any, impacton the data warehouse architecture or theanalytical applications using the warehouse.Sybase IQ offers a number of different indexingtechniques: Low Fast indexes: these are low cardinalityindexes (typically used for fields that haveless than 1,500 unique values) that use aprocess known as tokenisation. Using thisprocess, non-integer data is converted intoa token (an integer; an existing integer becomes its own token) and then the tokensare stored rather than the data. This isparticularly useful for reducing the quantity of redundant data and saving on diskspace. Once the tokens are established (anautomated process), a bitmapped index iscreated to reference these tokens. Bit-Wise indexes: for high cardinality fields,where the number of possible values exceeds1,500 (for example, monetary values) SybaseIQ uses a patented technology known asBit-Wise indexing. This is particularly usefulwhere you want to combine calculations withrange searches, for example to find the totalrevenue and number of units sold where theprice was less than 50. High Group indexes: these are, in fact, Btrees. However, the principle here is thatthe user only defines these indexes whenseveral columns are likely to be used ina group, in particular to combine low andhigh-cardinality searches. An example heremight be an inquiry about product itemsales and value (high cardinality) by store(low cardinality). High Group indices are nowmulti-threaded. Fast Projection indexes: the default “index”is simply the column store itself. If a useralways plans to retrieve an entire column ofdata, then the fact that storage is columnarmeans the column can be projected into areport or inquiry without having to explicitlydefine any index at all. This is useful, forexample, in WHERE clauses. The capabilities of this index type have been extended inthe latest release through the support forA Bloor InDetail Paper73-byte sizes that enable addressability for16 million unique values, which will extendperformance and compression for largedata sets. Text indexes: these support full text search.The text index stores complete positionalinformation for every instance of every termin the indexed columns. Some of the functions that are possible with the text index arediscussed in the next section. Compare indexes: this indexing techniqueallows data column comparisons that are effectively equivalent to an “if then else”statement. For example, “if expenses aregreater than revenue, then ”. This typeof index is particularly useful for real-timecomparisons in web applications. Join indexes: as the name implies, theseare designed to obviate the need for table joins. Like a number of the supportedindexes, these will be most useful whenquery requirements can be predicted inadvance. Significant performance enhancements have been made to join indexes inthis release with query parallelism allowingyou to scan columns in parallel and thenjoin in parallel. Time Analytic indexes: these offer the optionto create indexes based on a date, time, ordate and time. It should be noted that timebased queries tend to be particularly difficultfor conventional relational databases tohandle.A number of extended facilities are supportedto allow the use of these indexes in a variety ofcircumstances. These include index compression to reduce disk (or memory: bitmaps maybe cached) requirements, the ability to use different types of index in combination, pipelineparallelism for GROUP BY and ORDER BY aswell as hash and merge joins, and the facility tofilter bit arrays using Boolean operators suchas AND and OR. These features mean that theindexing in Sybase IQ overcomes a numberof the traditional drawbacks of bitmapping,namely, that it is not suitable for joining tablesor aggregating data. While on this topic it isalso worth noting that while Sybase is fastenou

MicroStrategy, QlikView, iDashboards, SAS, SPSS, KXEN and others. As part of SAP, Sybase IQ is being optimised for use with SAP BusinessObjects. Sybase IQ also offers optimised ETL (and ELT) capabilities as a separate add-on product en-abling developers to quickly build and deploy their data sets for analysis on Sybase IQ. Al-