IBM PureData System For Analytics Architecture

Transcription

Front coverIBM PureData System for Analytics ArchitectureA Platform for High Performance Data Warehousing and AnalyticsRedguidesfor Business LeadersPhil FranciscoTake advantage of the power and simplicity of apurpose-built appliance for high speed metricsImprove the quality and timeliness ofbusiness intelligenceQuery data quickly, efficiently, andeconomically

Executive overviewSuccess in any enterprise depends on having the best available information in time to makesound decisions. Anything less can waste opportunities, cost time and resources, and evenput the organization at risk. But finding crucial information to guide the best possible actionscan mean analyzing billions of data points and petabytes of data, whether to predict anoutcome, identify a trend, or chart the best course through a sea of ambiguity. Companieswith this type of intelligence on demand can react faster and make better decisions than theircompetitors.Continuing innovations in analytics provide companies with an intelligence windfall thatbenefits all areas of the business. When you need critical information urgently, the platformthat delivers this information should be the last thing on your mind. The platform needs to beas simple, reliable, and immediate as a light switch, able to handle almost incomprehensibleworkloads without complexity getting in the way. It must be built for longevity, with atechnology foundation that can sustain performance as more users run increasingly complexworkloads and as data volumes continue to grow. Furthermore, to maximize returns to thebusiness, it needs the lowest total cost of ownership.The IBM PureData System for Analytics, powered by Netezza technology, transformsthe data warehouse and analytics landscape with a platform that is built to deliverprice-performance with appliance simplicity. It can carry out monumental processingchallenges quickly, without barriers or compromises. For users and their organizations, itprovides the best intelligence for all who need it, even as demands for information escalate.As a purpose-built appliance for high-speed analytics, the PureData System for Analyticsstrength comes from having the right components assembled and working together tomaximize performance. Massively parallel processing (MPP) streams combine multi-coreCPUs with Netezza’s unique Field Programmable Gate Array (FPGA) Accelerated StreamingTechnology (FAST) engines to deliver performance that in many cases exceeds expectations.In addition, the system delivers results with no required indexing or tuning. Appliancesimplicity extends to application development, enabling organizations to innovate rapidly andto bring high performance analytics to a wider range of users and processes.This IBM Redguide publication introduces the Asymmetric Massively Parallel Processing(AMPP) architecture and describes how the system orchestrates queries and analytics toachieve speed. It shows how software and hardware come together to extract the maximumuse from every critical component and how a system optimized for thousands of usersquerying huge data volumes really works. The AMPP architecture is a unique datawarehouse and analytics platform whose price-performance is ready for today’s needs andtomorrow’s challenges. Copyright IBM Corp. 2014. All rights reserved.1

Architectural principlesNetezza technology integrates database, processing, and storage in a compact system that isoptimized for analytical processing and designed for flexible growth. The system architectureis based on the following core tenets of Netezza technology: Processing close to the data sourceBalanced, massively parallel architecturePlatform for advanced analyticsAppliance simplicityAccelerated innovation and performance improvementsFlexible configurations and extreme scalabilityProcessing close to the data sourceThe Netezza architecture is based on a fundamental computer science principle: whenoperating on large data sets, do not move data unless absolutely necessary. The systemtakes advantage of this principle by using commodity components called FieldProgrammable Gate Arrays (FPGAs) to filter out extraneous data as early in the data streamas possible and as fast as data streams off the disk. This process of data elimination close tothe data source removes I/O bottlenecks and frees downstream components (such as CPU,memory, and network) from processing superfluous data, thus having a significant multipliereffect on system performance.Balanced, massively parallel architectureThe Netezza architecture combines the best elements of Symmetric Multiprocessing (SMP)and Massively Parallel Processing (MPP) to create an appliance purpose-built for analyzingpetabytes of data quickly. Every component of the architecture, including the processor,FPGA, memory, and network, is carefully selected and optimized to service data as fast asthe physics of the disk allows, while minimizing cost and power consumption. The softwareorchestrates these components to operate concurrently on the data stream in a pipelinefashion, thus maximizing utilization and extracting the utmost throughput from each MPPnode. In addition to raw performance, this balanced architecture delivers linear scalability tomore than a thousand processing streams executing in parallel, while offering a veryeconomical total cost of ownership.Platform for advanced analyticsThe principles of MPP and processing data close to the source are equally applicable toadvanced analytics on large data sets. The PureData System for Analytics appliances simplyprocess on a massively parallel scale complex algorithms expressed in languages other thanSQL, with none of the intricacies typical of parallel and grid programming. Running analyticsof any complexity on stream against huge data volumes eliminates the delays and costsincurred moving data to separate hardware. It accelerates performance by orders ofmagnitude, making the PureData System for Analytics the ideal platform to converge datawarehousing with advanced analytics.2IBM PureData System for Analytics Architecture

Appliance simplicityBy automating and streamlining day-to-day operations, the Netezza architecture shieldsusers from the underlying complexity of the platform. Simplicity rules whenever there is adesign tradeoff with any other aspect of the appliance. Unlike other solutions, it just runs,handling demanding queries and mixed workloads with blistering speed, without the tuningrequired by other systems. Even normally time-consuming tasks such as installation,upgrades, and ensuring high availability and business continuity are vastly simplified, savingprecious time and resources.Accelerated innovation and performance improvementsOne of the key goals of the Netezza architecture is to deliver price-performanceimprovements and innovative functionality faster than competing technologies over the longrun. While the use of open, blade-based components allows the Netezza architecture toincorporate technology enhancements very quickly, the turbocharger effect of the FPGA, abalanced hardware configuration, and tightly coupled intelligent software combine to deliveroverall performance gains far greater than those of individual elements. In fact, the Netezzaplatform has delivered more than four times performance improvement every two years(double that of Moore’s Law) since its introduction.Moore’s law: Gordon Moore, Intel co-founder, predicted in 1965 that the number oftransistors on a chip will double about every two years. Software applications generally relyon these processor improvements to accelerate performance over time.aa. “Cramming more components onto integrated circuits,” Gordon Moore, Electronics, Volume 38,Number 8, 19 April 1965.Flexible configurations and extreme scalabilityPureData System for Analytics scales modularly from a few hundred gigabytes to petabytesof queryable user data. The system architecture serves the needs of different segments of thedata warehouse and analytics market. The use of open blade-based components allows thedisk-processor-memory ratio to be easily modified in configurations that cater toperformance- or storage-centric requirements. The same architecture also supportsmemory-based systems that provide extremely fast, real-time analytics for mission-criticalapplications.3

System building blocksA major part of the PureData System for Analytics performance advantage comes from itsunique AMPP architecture (shown in Figure 1), which combines an SMP front end with ashared nothing MPP back end for query processing. Each component of the architecture iscarefully chosen and integrated to yield a balanced overall system. Every processing elementoperates on multiple data streams, filtering out extraneous data as early as possible. Morethan a thousand of these customized MPP streams work together to divide and conquer sS-BladesNetworkFabricFigure 1 AMPP architectureLet’s examine the key building blocks of the appliance: PureData System for Analytics hostsThe SMP hosts are high-performance Linux servers set up in an active-passiveconfiguration for high availability. The active host presents a standardized interface toexternal tools and applications. It compiles SQL queries into executable code segmentscalled snippets, creates optimized query plans, and distributes the snippets to the MPPnodes for execution. Snippet Blades (S-Blades)S-Blades are intelligent processing nodes that make up the turbocharged MPP engine ofthe appliance. Each S-Blade is an independent server containing powerful multi-coreCPUs, multi-engine FPGAs, and gigabytes of RAM, all balanced and working concurrentlyto deliver peak performance. The CPU cores are designed with ample headroom to runcomplex algorithms against large data volumes for advanced analytics applications.4IBM PureData System for Analytics Architecture

Disk enclosuresThe disk enclosures’ high-density, high-performance disks are RAID protected. Each diskcontains a slice of every database table's data. A high-speed network connects diskenclosures to S-Blades, allowing all the disks in the system to simultaneously stream datato the S-Blades at the maximum rate possible. Network fabricA high-speed network fabric connects all system components. The PureData System forAnalytics runs a customized IP-based protocol that fully utilizes the total cross-sectionalbandwidth of the fabric and eliminates congestion even under sustained, bursty networktraffic. The network is optimized to scale to more than a thousand nodes, while allowingeach node to initiate large data transfers to every other node simultaneously.Note: All system components are redundant. While the hosts are active-passive, allother components in the appliance are hot swappable. User data is fully mirrored,enabling better than 99.99% availability.Where extreme performance happens: Inside an S-BladeCommodity components and Netezza technology software combine to extract the utmostthroughput from each MPP node. A dedicated high-speed interconnect from the storage arraydelivers data to memory as quickly as each disk can stream. Compressed data is cached inmemory using a smart algorithm, which ensures that the most commonly accessed data isserved right out of memory instead of requiring a disk access. FAST Engines (shown inFigure 2) running in parallel inside the FPGAs uncompress and filter out 95–98% of table dataat physics speed, keeping only data needed to answer the query. The remaining data in thestream is processed concurrently by CPU cores, also running in parallel. The process isrepeated on more than a thousand of these parallel Snippet Processors running in GACPUNICHostHostFigure 2 Inside S-Blade5

Turbocharging the S-Blades: The power of Netezzatechnologies FAST enginesThe FPGA is a critical enabler of the price-performance advantages of the PureData Systemfor Analytics platform. Each FPGA contains embedded engines that perform filtering andtransformation functions on the data stream. These FAST engines (shown in Figure 3) aredynamically reconfigurable, allowing them to be modified or extended through software. Theyare customized for every snippet through parameters provided during query execution andact on the data stream delivered by a Direct Memory Access (DMA) module at extremely GAFigure 3 Netezza technologies FAST enginesFAST engines include: The Compress engine, a Netezza innovation boosting system performance by a factor of 4to 8 times. The engine uncompresses data at wire speed, instantly transforming eachblock on disk into 4 to 8 blocks in memory. The result is a significant speedup of theslowest component in any data warehouse, the disk. The Project and Restrict engines, which further increase performance by filtering outcolumns and rows respectively, based on the parameters in the SELECT and WHERE clausesin a SQL query. The Visibility engine, which plays a critical role in maintaining Atomicity, Consistency,Isolation, and Durability (ACID) compliance at streaming speeds in the system. It filters outrows that should not be seen by a query; for example, rows belonging to a transaction thatis not yet committed.The FAST engines provide an extensible framework for innovative future functions to beadded through updates to the system. These new functions promise further improvement insystem performance, security, and reliability.6IBM PureData System for Analytics Architecture

Orchestrating queries on PureData System for AnalyticsThe PureData System for Analytics hardware components and intelligent system software areclosely intertwined. The software (shown in Figure 4) is designed to fully exploit the hardwarecapabilities of the appliance and incorporates numerous innovations to offer exponentialperformance gains, whether for simple inquiries, complex ad-hoc queries, or deep analytics.In this section, we examine the intelligence built into the system every step of the way.Execution EngineFAST EnginesSchedulingQuery AnalysisSchedulerCompilerObject CacheOptimizerExecution EngineFAST EnginesSystem CatalogExecution EngineFAST EnginesDiskEnclosuresS-BladesNetworkFabricNetezza HostFigure 4 Software architecturePureData System for Analytics software components include: A sophisticated parallel optimizer that transforms queries to run more efficiently andensures that each component in every processing node is fully utilized An intelligent scheduler that keeps the system running at its peak throughput, regardlessof workload Turbocharged Snippet Processors that efficiently execute multiple queries and complexanalytics functions concurrently A smart network that makes moving large amounts of data through the Netezza system abreezeLet’s see how these elements work together, starting when a user submits a query.Technology-savvy readers will see that PureData System for Analytics processes queriesvery differently than other data warehouse systems.Make an optimized query planThe host compiles the query and creates a query execution plan optimized for the AMPParchitecture. The intelligence of the optimizer is one of the system’s greatest strengths. Theoptimizer makes use of all the MPP nodes in the system to gather detailed, up-to-datestatistics on every database table referenced in a query. A majority of these metrics are7

captured during query execution with very low overhead, yielding just-in-time statistics thatare individualized per query. The appliance nature of the system, with integrated componentsable to communicate with each other, allows the cost-based optimizer to more accuratelymeasure disk, processing, and network costs associated with an operation. By relying onaccurate data rather than heuristics alone, the optimizer is able to generate query plans thatutilize all components with extreme efficiency.Intelligence in the optimizer (calculating join order): One example of optimizerintelligence is the ability to determine the best join order in a complex join. For example,when joining multiple small tables to a large fact table, the optimizer can choose tobroadcast the small tables in their entirety to each of the S-Blades, while keeping the largetable distributed across all Snippet Processors. This approach minimizes data movementwhile taking advantage of the AMPP architecture to parallelize the join.By using these statistics to transform queries before processing begins, the optimizerminimizes disk I/O and data movement, the two factors slowing performance in a datawarehouse system. Transforming operations performed by the optimizer include: Determining correct join order Rewriting expressions Removing redundancy from SQL operationsConvert it to snippetsThe compiler converts the query plan into executable code segments, called snippets, whichare query segments executed by Snippet Processors in parallel across all the data streams inthe appliance. Each snippet has two elements: Compiled code executed by individual CPU cores A set of FPGA parameters to customize the FAST engines’ filtering for that particularsnippetThis snippet-by-snippet customization allows the PureData System for Analytics to provide, ineffect, a hardware configuration optimized on the fly for individual queries.Intelligence in the compiler (the object cache): The host uses a feature called theobject cache to further accelerate query performance. This is a large cache of previouslycompiled snippet code that supports parameter variations. For example, a snippet with theclause, where name ‘bob’ might use the same compiled code as a snippet with theclause, where name ‘jim’ but with settings that reflect the different name. This approacheliminates the compilation step for over 99% of snippets.Schedule them to run at just the right momentThe Netezza scheduler (shown in Figure 5) balances execution across complex workloads tomeet the objectives of different users, while maintaining maximum utilization and throughput.It considers a number of factors, including query priority, size, and resource availability, indetermining when to execute snippets on the S-Blades. The scheduler uses the appliancearchitecture to gather up-to-date and accurate metrics about resource availability from eachcomponent of the system. Using sophisticated algorithms, the scheduler maximizes systemthroughput by using close to 100% of the disk bandwidth and ensuring that memory andnetwork resources are not overloaded, a common cause of thrashing for other, less efficientsystems. This is an important characteristic of the PureData System for Analytics, ensuringthe system keeps performing at peak throughput even under very heavy loads.8IBM PureData System for Analytics Architecture

When the scheduler gives the green light, the snippet is broadcast to all Snippet Processorsthrough the intelligent network fabric.Query 1DiskDiskMemoryQuery NDiskNetworkDiskMemoryNetworkDisk Resource BinNetworkNetworkNetwork Resource BinMemoryMemoryMemory Resource BinFigure 5 Intelligence in the Scheduler: no resource overloadingExecute them in parallelEach Snippet Processor on every S-Blade now has the instructions it needs to execute itsportion of the snippet. In addition to the host scheduler, the Snippet Processors have theirown smart preemptive scheduler that allows snippets from multiple queries to executesimultaneously. The scheduler takes into account the priority of the query and the resourcesset aside for the user or group that issued it to decide when and for how long to schedule aparticular snippet for execution. When that instant arrives, it’s show time:1. The processor core on each Snippet Processor configures the FAST engines withparameters contained in the query snippet and sets up a data stream.2. The Snippet Processor reads table data from the disk array into memory, utilizing aNetezza technology innovation called ZoneMap acceleration to reduce disk scans. TheSnippet Processor also interrogates the cache before accessing the disk for a data block,avoiding a scan if the data is already in memory.3. The FPGA then acts on the data stream. It first accelerates the data stream by a factor ofup to 4 to 8 times by uncompressing the data stream at wire speed.4. The FAST engines then filter out any data not relevant to the query. The remaining datastreams back to memory for concurrent processing by the CPU core. This data is typicallya tiny fraction (2–5%) of the original stream, greatly reducing the execution time requiredby the processor core.5. The processor core picks up the data stream and performs core database operations suchas sorts, joins, and aggregations. It also applies complex algorithms embedded in theSnippet Processor for advanced analytics processing.6. Results from each Snippet Processor are assembled in memory to produce a sub-resultfor the entire snippet. This process is repeated simultaneously across more than athousand Snippet Processors, with hundreds or thousands of query snippets executing inparallel.9

ZoneMap acceleration (the PureData System for Analytics anti-index): ZoneMapacceleration exploits the natural ordering of rows in a data warehouse to accelerateperformance by orders of magnitude. The technique avoids scanning rows with columnvalues outside the start and end range of a query. For example, if a table contains twoyears of weekly records ( 100 weeks) and a query is looking for data for only one week,ZoneMap acceleration can improve performance up to 100 times. Unlike indexes,ZoneMaps are automatically created and updated for each database table, withoutincurring any administrative overhead.Return the resultsAll Snippet Processors now have snippet results that must be assembled. The SnippetProcessors use the intelligent network fabric to communicate flexibly with the host and witheach other to perform intermediate calculations and aggregations.Intelligence in the network (predictable performance and scalability): The PureDataSystem for Analytics custom network protocol is designed specifically for the data volumesand traffic patterns associated with high volume data warehousing. This protocol ensuresmaximum utilization of the network bandwidth without overloading it, allowing predictableperformance close to the line rate.Traffic flows smoothly in three distinct directions: From the host to the Snippet Processors (1 to 1000 ) in broadcast mode From Snippet Processors to the host (1000 to 1), with aggregation in the S-Blades andat the system rack level Between Snippet Processors (1000 to 1000 ), with data flowing freely on a massivescale for intermediate processingThe host assembles the intermediate results received from the Snippet Processors, compilesthe final result set and returns it to the user's application. Meanwhile, other queries arestreaming through the system at various stages of completion.SummaryThe best solutions are not necessarily the biggest or most expensive, they are the ones thathave the smartest design. The PureData System for Analytics exploits the inherent advantagethat streaming processing provides over the traditional computing architectures used by otheranalytic and data warehousing systems. The result is a compact appliance with performancedwarfing that of much larger systems, with blinding speed for running complex algorithmsagainst huge data volumes and the mixed workloads created by thousands of concurrentusers.10IBM PureData System for Analytics Architecture

Processing performance is complemented by other capabilities that make the IBM solution aunique platform to help businesses succeed, including: Simplicity of useThe PureData System for Analytics is self-managed, as an appliance should be, and isalways running at its peak throughput. The system software ensures that without humanintervention. Better decisions across the enterpriseEmbedded functions bring a new generation of analytics into the database with minimumdevelopment effort. There is no need for separate server hardware or time lost in massivedata transfers—just lightning-fast results and the ability to bring crucial businessintelligence to everyone who could benefit, in all sectors of an organization. Agility for the futureThe system is built not just for today's challenges, but for years to come, scaling linearly topetabytes of user data and with performance acceleration far beyond the conventionalspeed-up governed by Moore’s Law.PureData System for Analytics allows you and your company to make decisions withmaximum clarity while taking performance for granted. But do not just take our word for it. Thebest way to appreciate PureData System for Analytics is to see it in action. We think you willagree there is simply nothing else like it for making the most of your data.Other resources for more informationFor additional information, refer to the IBM PureSystems d analytics.htmlThe author who wrote this guideThis guide was produced by a specialist working with the International Technical SupportOrganization (ITSO).Phil Francisco is Vice President, Data Management Products and Strategy, IBM. He hasover 25 years of valuable experience in technology development and global technologymarketing. As Vice President of Data Management Products and Strategy, he currentlydirects the product portfolio and strategy for all database software and PureData systemproducts for the IBM Information Management division.Previously Mr. Francisco was Vice President of Product Marketing and Product Managementfor the IBM PureData System for Analytics and Netezza products; a role he held both priorand subsequent to the acquisition of Netezza by IBM. Prior to Netezza, he held VicePresident of Marketing and Product Management roles at PhotonEx and LucentTechnologies’ Optical Networking Group. In addition, he has more than 10 years ofexperience in software, hardware, and systems engineering at AT&T/Lucent BellLaboratories. Mr. Francisco holds a patent in advanced optical network architectures. Heearned his Master’s Degree in Electrical Engineering from Stanford University and completedthe Advanced Management Program at the Fuqua School of Business at Duke University. Hereceived B.S.E. degrees in Electrical Engineering and Computer Science from the MooreSchool of Electrical Engineering at the University of Pennsylvania.11

Thanks to the following people for their contributions to this project:Stephanie CaputoIBM Software Group, Information ManagementJim TuckwellIBM Software Group, Information ManagementLindaMay PattersonInternational Technical Support Organization, Rochester CenterNow you can become a published author, too!Here’s an opportunity to spotlight your skills, grow your career, and become a publishedauthor—all at the same time! Join an ITSO residency project and help write a book in yourarea of expertise, while honing your experience using leading edge technologies. Your effortswill help to increase product acceptance and customer satisfaction, as you expand yournetwork of technical contacts and relationships. Residencies run from two to six weeks inlength, and you can participate either in person or as a remote resident working from yourhome base.Find out more about the residency program, browse the residency index, and apply online at:ibm.com/redbooks/residencies.htmlStay connected to IBM Redbooks publications Find us on Facebook:http://www.facebook.com/IBMRedbooks Follow us on Twitter:http://twitter.com/ibmredbooks Look for us on LinkedIn:http://www.linkedin.com/groups?home &gid 2130806 Explore new IBM Redbooks publications, residencies, and workshops with the IBMRedbooks weekly sf/subscribe?OpenForm Stay current on recent Redbooks publications with RSS Feeds:http://www.redbooks.ibm.com/rss.html12IBM PureData System for Analytics Architecture

NoticesThis information was developed for products and services offered in the U.S.A.IBM may not offer the products, services, or features discussed in this document in other countries. Consultyour local IBM representative for information on the products and services currently available in your area. Anyreference to an IBM product, program, or service is not intended to state or imply that only that IBM product,program, or service may be used. Any functionally equivalent product, program, or service that does notinfringe any IBM intellectual property right may be used instead. However, it is the user's responsibility toevaluate and verify the operation of any non-IBM product, program, or service.IBM may have patents or pending patent applications covering subject matter described in this document. Thefurnishing of this document does not give you any license to these patents. You can send license inquiries, inwriting, to:IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.The following paragraph does not apply to the United Kingdom or any other country where suchprovisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATIONPROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS ORIMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer ofexpress or implied warranties in certain transactions, therefore, this statement may not apply to you.This information could include technical inaccuracies or typographical errors. Changes are periodically madeto the information herein; these changes will be incorporated in new editions of the publication. IBM may makeimprovements and/or changes in the product(s) and/or the program(s) described in this publication at any timewithout notice.Any references in this

The principles of MPP and processing data close to the source are equally applicable to advanced analytics on large data sets. The PureData System for Analytics appliances simply process on a massively parallel scale complex algorithms expressed in languages other than SQL, with none of the intricacies typical of parallel and grid programming.