Hadoop And HP Vertica Analytics Platform

Transcription

Business white paperMake all yourinformationmatterHadoop and HP Vertica Analytics Platform

Table of contents3 Executive summary4 The Big Data problem5 Complementary analytics platforms7 The best of both worlds: HP Vertica Analytics Platformand Hadoop together8 Consider these real-world use cases9 Use cases with HBase and HP Vertica Analytics Platform10 Strengths and limitations of popular Hadoop components11 Key takeaways11 To learn more

Executive summaryHP Vertica Analytics Platform and Hadoop are highly complementarysystems for Big Data analytics. The HP Vertica Analytics Platform isideal for interactive real-time analytics and the Hadoop open-sourceplatform is well suited for batch-oriented data processing.When used together, the HP Vertica Analytics Platform and Hadoopprovide your organization with a powerful set of data analyticscapabilities that can do far more than either platform could doon its own. This combination enables your enterprise to extractsignificantly higher levels of value from massive amounts ofstructured, unstructured, and semi-structured data.3

The Big Data problemVolume, velocity, and variety create complexityIn today’s information-driven world, your enterprise faces anonslaught of structured, unstructured, and semi-structured data.While conventional business systems continue to swell in size,storage environments are being hit with a hurricane of social mediacontent, audio and video files, email, text messages, image files,documents, transactional information, and more.In one case in point, large communications networks and theirassociated switches, billing systems, and service departmentsgenerate hundreds of millions of individual call details records(CDRs) daily. These terabytes of dynamic customer data willcontinue to grow exponentially as carriers add new services and asIP-based traffic increases.Already, the number of subscribers to mobile, fixed-line, and cablecommunications services is growing by millions of people every year.And the volume of CDR, Internet Protocol Detail Records (IPDRs),subscriber profile information, network probe, and machine-tomachine data that communications companies must store andanalyze is expected to grow by 12 to 13% per year.1To gain value from massive amounts of data, your enterprise needspowerful Big Data analytics tools. These tools go far beyond thecapabilities of traditional relational database management systems(RDBMSs), which were designed for online transaction processing(OLTP) and structured data—and not for the volume, velocity, andvariety of a world of Big Data.With these needs in mind, hundreds of organizations are deployingHP Vertica Analytics Platform for interactive real-time analyticsand the Hadoop open-source platform for batch-oriented dataprocessing. This combination of complementary data toolsetsenables your enterprise to unify your structured, unstructured, andsemi-structured data—and make all your information matter.14Source: “Market trends: Big Data opportunities in vertical industries,” Gartner, July 2012.

Complementary analytics platformsTwo platforms purpose-built for Big DataHP Vertica Analytics Platform and Hadoop are highly complementaryanalytics platforms. Both are modern, scalable, massively parallelprocessing (MPP) systems built for commodity hardware and lowcost processing of Big Data. While they have some overlappingcapabilities, each of the platforms offers unique features that helpyour organization capitalize on the full range of your data.HP Vertica Analytics Platform is a real-time analytics databaseplatform purpose-built for Big Data. It consists of a massivelyparallel database and an extensible analytics framework optimizedfor fast data analysis—scaling from gigabytes to petabytes.Additionally, HP Vertica Analytics Platform supports standards likeSQL, JDBC, ODBC, and R for data analysis. This standards-basedversatility makes it easier for your organization to preserve yourexisting business intelligence (BI) investments.The Hadoop Distributed File System (HDFS), in turn, is an opensource distributed file system that can serve as an effective storageground for large amounts of data. Hadoop is extremely efficientat loading any type of data—structured, unstructured, or semistructured. Hadoop is also well suited for batch processing whereimmediate interactive analytics are not required.Complementary, not competitiveHP Vertica Analytics Platform is custom built for high-performanceanalytics. It is orders of magnitudes more efficient in highly analyticaluse cases compared to Hadoop. In an internal benchmark comparison,Counting Triangles, the HP Vertica Analytics Platform was 40 timesfaster than a comparable Hadoop program and 22 times faster thana program written in Pig, a framework that provides a higher-levellanguage to increase developer productivity in Hadoop.This level of performance can significantly reduce the time requiredto extract knowledge from your data, creating more businessopportunities to monetize your data. That point was underscored ina study that pitted parallel DBMSs against the Hadoop MapReduceframework on a variety of tasks.2 The study—MapReduce and ParallelDBMSs: Friends or Foes?—yielded these results on three tasks:Figure 1: Benchmark performance on a 100-node clusterHadoopHP 60060010040040050108s0200268s55s00Grep task2001158sWeb log taskJoin taskUnlike Hadoop, HP Vertica Analytics Platform is a next-generationanalytical database platform with standard SQL and ACID transactionsupport, combined with much more advanced analytics andprocessing capabilities.3 Hadoop is not a database. HP VerticaAnalytics Platform also supports popular ecosystems for businessintelligence, ETL (extract, transform, load) data warehouses, and datamanagement, including Cognos, Microstrategy, Tableau, and others.In general, Hadoop is well suited for long-running batch mode dataprocessing and some analytics, while HP Vertica Analytics Platform isdesigned purposefully for interactive and real-time analytics as wellas data processing.2Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., and Rasin,A. MapReduce and Parallel DBMSs: Friends or Foes. In Communications of the ACM,January 2010. cm2010.pdf3For detailed information on the analytics and processing capabilities of HP VerticaAnalytics Platform, visit: vertica.com/the-analytics-platform/.5

A few caveats about HadoopConsider total cost of ownershipMany technology teams, frustrated with the high costs and limitationsof traditional row-oriented data warehouses and ETL infrastructures,see Hadoop as a potential replacement for their entire datawarehouse infrastructure. Like many open-source solutions, Hadoop,at least at first glance, appears to be a solution for every problem.However, Hadoop has its limitations, including those that come with adocument-oriented, batch-oriented system.When you deploy Hadoop into production, maintenancerequirements and programming demands can rapidly overwhelmany potential cost advantages. A Hadoop ecosystem typicallyincludes a team of developers who specialize in massively paralleldevelopment and create custom code for queries, driving up yourtotal cost of ownership.Hadoop should be viewed as one step toward an enhanced analyticinfrastructure. While it is effective for exploration, Hadoop, unlikeHP Vertica Analytics Platform, is not optimized for analysis orperformance. In addition, Hadoop creates multiple integration pointsthat can result in a fragile flow of data. Adding new flows is oftentime-consuming and costly. What’s more, with Hadoop, analyticcycles (the timeframe from receiving raw data to making effectivedecisions) tend to be longer than they should be. None of the Hadooptools solves the problem of real-time, ad hoc access to data for youranalysts, who specialize in understanding the data—not just the code.Given its limitations, a Hadoop-only deployment should be thoughtof as a short-term solution targeted at certain needs, such asexploratory analysis of unstructured data at a low initial cost for usein a lab setting when applying structure to semi-structured data (e.g.,web logs). Hadoop’s greatest strengths emerge when it is paired witha higher-end analytics platform that delivers the performance andspeed needed to keep a business competitive.Hadoop’s greatest strengths emerge when it ispaired with a higher-end analytics platform thatdelivers the performance and speed needed tokeep a business competitive.6In many use cases, HP Vertica Analytics Platform delivers higherperformance with less hardware, with less administrativecomplexity, and with standard SQL and no custom coding—so youdon’t need to be a developer to write a query. These factors canequate to a significantly lower total cost of ownership when youchoose the HP Vertica Analytics Platform for certain use cases (seethe use cases below).An analytics hubWith the release of HP Vertica 6, HP took Big Data analytics to a newlevel of usability and performance. Your enterprise can now locatedata where it makes sense and access it through your interfaces ofchoice, including standard SQL, business intelligence tools, or advancedanalytics languages, such as R. This universal data access framework,combined with the platform’s massively parallel architecture, enablesyour organization to gain richer insights into all your data in a muchshorter timeframe—minutes, not weeks or months.

The best of both worlds: HP VerticaAnalytics Platform and Hadoop togetherUsing Hadoop and HP Vertica Analytics Platform together givesyou more value than you could realize using either of the platformsseparately. You get the best of both worlds. And HP makes it easy toconnect the two platforms.Connecting the platformsTo enable the integration of the two platforms, HP offers connectorsthat allow you to seamlessly move data back and forth betweenHadoop and HP Vertica Analytics Platform. With the release ofHP Vertica 6, HP helps your organization accelerate your Hadoopenvironment by using HP Vertica Analytics Platform analyticscapabilities broadly across your Hadoop systems. With theseconnectors in place, your users can choose to load the data upfrontor at query time—an invaluable capability for data scientistspursuing data exploration.HP provides several ways to use Hadoop and Vertica in acomplementary manner with support for: MapReduce—If you choose to use MapReduce programming, HPprovides a bi-directional Hadoop connector as a source and syncto your MapReduce jobs. Sqoop—A JDBC connector also works with Sqoop, a tool thatenables you to transfer bulk data between Hadoop and yourdatabases. Hadoop Distributed File System (HDFS)—HP provides an HDFSconnector that allows you to directly load data from HDFS into HPVertica Analytics Platform tables using the copy command. HPVertica Analytics Platform also supports external tables that candirectly load data from your HDFS per query. This allows you touse HP Vertica SQL and analytics directly on your HDFS data.Joint use casesYou can leverage the relative strengths of the two platforms forseveral use cases. In general, Hadoop is suitable for batch modeanalysis and HP Vertica Analytics Platform for interactive analytics.Here are a few such use cases: Hadoop for ETL and HP Vertica Analytics Platform foranalytics—Convert unstructured or semi-structured logs into astructured format (relational tuples) for analysis with HP VerticaAnalytics Platform. In this scenario, Hadoop serves primarily as anETL tool and HP Vertica Analytics Platform as the data analyticsengine. HDFS for storage and HP Vertica Analytics Platform plus Hadoopfor analytics—Run real-time analytics on HP Vertica AnalyticsPlatform to capitalize on the speed and the performance of theanalytics platform. Long-running and exploratory analytics runon Hadoop, relying on the fault tolerance of the Hadoop platform.This scenario enables you to load data from HDFS directly to HPVertica Analytics Platform and provide HP Vertica SQL access toHDFS—again, using Hadoop primarily for data storage (or a data“parking lot”) and HP Vertica Analytics Platform for fast analysis HP Vertica Analytics Platform for storage and analytics andHadoop as a multi-purpose tool—A less common use case is touse HP Vertica Analytics Platform primarily for data storage, andtake advantage of Hadoop’s capabilities beyond MapReduce, suchas scheduler and load balancing, data conversion tools for otherformats (for example, STATA), and backup for HDFS via Sqoop.Overlapping use casesHP Vertica Analytics Platform handles a range of use cases just aswell as or better than Hadoop. In many cases, HP Vertica AnalyticsPlatform requires less hardware and less administrative complexityin delivering higher performance. Also, HP Vertica Analytics Platformuses standard SQL, as opposed to custom coding, to analyze data,contributing to an overall lower total cost of ownership. That said,in some use cases, either HP Vertica Analytics Platform or Hadoopcould be used effectively.Here are a few examples: Analyzing logs and machine data—Depending on yourpreference or development skills, you can use the HP VerticaSDK to write custom C or R code or use Hadoop or Java forlog parsing and analyzing machine data. HP also makes parsersavailable on GitHub for web logs, tag clouds, and more. Forany use case that does not require fault tolerance (for example,for long-running analysis), HP Vertica Analytics Platform istypically used. Ingesting XML, JSON, and Avro formats—Again, depending onyour preference, you can ingest these formats in Hadoop or HPVertica Analytics Platform. It all depends on where you primarilyimport and store the data and then perform the real-timeanalytics with HP Vertica Analytics Platform.7

Consider these real-world use casesHere are some real-world examples—based on how organizationsare using the HP Vertica Analytics Platform and Hadoop today.These examples show how your organization could use HP VerticaAnalytics Platform and Hadoop in a complementary manner.Processing social video eventsA social video company uses Hadoop for batch processing of logsand HP Vertica Analytics Platform for ETL, ad hoc analytics, andinteractive dashboards. In addition, the company uses a KV storefor serving low-latency data needs. This architecture allows thecompany to collect and process hundreds of millions of events dailyon a petabyte-scale infrastructure.Accelerating drug discoveryA pharmaceutical company sought to analyze gene variants forimproved drug targeting and discovery. The company found itssolution in a combination of Hadoop and HP Vertica AnalyticsPlatform, with a few additional supporting tools. It uses Hadoop tofind the variants between a sample sequence and a reference genome,and uses HP Vertica Analytics Platform to run structured analysis onvery large sets of data to determine oncology targets. In addition, thecompany uses HDFS for a raw data store and Hadoop/MapReduce forgenomic algorithms that aren’t based on structured data.8Delivering digital consumer insightsA digital intelligence company uses HDFS to store raw inputbehavioral data and Hadoop/MapReduce to find conversions(regular-expressions processing) by determining what type of userclicked on a particular advertisement, and HP Vertica AnalyticsPlatform to store and operationalize high-value business data. Inaddition, the company’s Big Data solution supports reporting andanalytics via Tableau and the R programming language, and it usescustom ETL. This combination of technologies helps the companyachieve faster insights that are delivered more consistently with lessadministrative overhead and lower-cost, commodity hardware.Enabling privacy assuranceA company focused on web privacy uses HDFS to collect user privacyreporting requests, MapReduce to process and structure the datainto HP Vertica Analytics Platform (ETL), and the platform to analyzestatistics for every third-party tag on a website in measuring siteperformance. Consumers benefit from a free browser plug-in thatcan tell them who is tracking them. Advertisers, in turn, can providegreater transparency to end users and better understand the impactof third-party tags on website performance.

Use cases with HBase and HP VerticaAnalytics PlatformAnalyzing Facebook dataYou can use the capabilities of HP Vertica Analytics Platform andHadoop in a complementary manner to gain insights from Facebookdata. Say you want to look up the number of users who hit the “like”button for items appearing on a page on Facebook. HBase, the opensource, column-oriented database that relies on Hadoop, is wellsuited to this task—a single key-value lookup.But what if you want to analyze all of the recent clicks acrossFacebook to identify the 20 fastest growing “likes” in the Facebookenvironment? While nothing in the Hadoop ecosystem lets youconduct analytics over hundreds of millions of rows in a database,HP Vertica Analytics Platform is well suited to the challenge. What’smore, HP Vertica Analytics Platform has a higher-level language thatallows your users to express queries easily, while Hadoop typicallyrequires a developer to do the same.Understanding power usage trendsA power utility benefits from using the complementary capabilitiesof HP Vertica Analytics Platform and Hadoop to help its customersand engineers understand power usage trends. Hadoop is well suitedto “personal analytics” applications that allow customers to retrieveinformation that provides insights into their power-usage trendsover different time periods and different weather conditions. HBaseis adept at pulling up small numbers of rows from database tables.At the same time, utility engineers can run power-usage analyticsover the entire body of customer usage data to help them gaininsights into optimal configuration of the company’s electric grid.HP Vertica Analytics Platform is ideally suited for this undertaking.It is designed for analytics over petabytes of data.9

Strengths and limitations of popularHadoop componentsIn addition to HDFS and MapReduce, popular components of theHadoop ecosystem include HBase, Hive, and Pig. Here is a look atthese components, including their strengths and limitations, andhow they can complement HP Vertica Analytics Platform.HBaseDescriptionStrengths Open-source,column-orienteddatabase, modeledafter BigTable (fromGoogle)Limitations Provides randomaccess to HadoopDistributed FileSystem (HDFS) data Is independent frombatch MapReduce No standard SQL support; yourexisting BI tools will not work Designed for workloads withsimple key-value or rangelookups, not complex analytics No support for ACID/Transactions; you cannot useHBase to replace existingdatabase applications withoutcustom coding L arge hardware footprint andadded complexity of the HadoopstackAs noted in the use cases above, HBase and HP Vertica AnalyticsPlatform can work in a complementary manner. You can use HBaseas a serving platform and HP Vertica Analytics Platform for complexanalytics and model development. For example, in an advertisinguse case, you might look up user profiles employed in ad-servingfrom HBase and use HP Vertica Analytics Platform to do analysisthat generates the user profiles.HiveDescriptionStrengths A tool developed atFacebook to provideSQL-like languageaccess (HQL) toHadoopLimitations Provides a SQL-likeinterface to Hadoop Very limited subset SQL,compared to HP VerticaAnalytics Platform Fundamentally a Hadoopbased solution, and has similarproperties to Hadoop in termsof performance, administrationcomplexity, and a largehardware footprintHP Vertica external tables over HDFS are a complete replacement forHive. HP Vertica Analytics Platform offers much more complete SQLand analytic support and much better performance. What’s more,you have an easy migration path into HP Vertica Analytics Pla

management, including Cognos, Microstrategy, Tableau, and others. In general, Hadoop is well suited for long-running batch mode data processing and some analytics, while HP Vertica Analytics Platform is designed purposefully for intera