Big Data Analytics: Future Architectures, Skills And .

Transcription

w h i t ep a p e rBig Data Analytics:Future Architectures,Skills and Roadmapsfor the CIOSeptember 2011By Philip CarterSponsored by

w h i t e pa p e rBig Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOBrave New World of Big DataThe ‘Big Data Era’ has arrived — multi-petabyte data warehouses, socialmedia interactions, real-time sensory data feeds, geospatial informationand other new data sources are presenting organisations with a range ofchallenges, but also significant opportunities. IDC believes that as CIOsstart to adopt the new class of technologies required to process, discoverand analyse these massive data sets that cannot be dealt with usingtraditional databases and architectures, it will become clear that the realvalue will be derived from the high-end analytics that can be performedon the increasing volumes, velocity and variety of data that organisationsare generating – or Big Data analytics.One of the key differences between analytics in the traditional mode, and what we are dealing with in termsof the Big Data era is that we are gathering data that we may or may not need – and from the perspective ofanalysis, this means ‘we don’t know what we don’t know’ – hence, the variables and models are likely to beentirely new, requiring a different infrastructure strategy and perhaps most importantly, new skill sets.The objective of this white paper is to explore the initial impact that Big Data is having on organisations,particularly the IT departments – which is being forced to re-assess architectures, delivery models and futureroadmaps. It will explore the following areas in more detail:Defining Big Data. This is not in the contextof the quantity or threshold that actuallyquantifies Big Data (as this is changing all thetime, and will be applied differently, dependingon the vertical and market segment), but morein terms of a new generation of technologiesand architectures, designed to economicallyextract value from very large volumes of awide variety of data, by enabling high-speedcapture, discovery and/or analysis.Hadoop, Mapreduce, Key ValueStore? There is a lot of hype around the newtechnologies that are being used by the marketto deal with the Big Data phenomenon. Wewill highlight some of these and their relativeimportance.The Value of Big Data in Analytics.The bottom line here is that it is getting morecomplicated to process and analyse these1

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOlarge and growing data sets – and it essentiallyrequires a re-assessment of the broaderinformation management strategies for themajority of organisations that have started theirbusiness analytics journey.Why Big Data Analytics is Important(and Different). Many have asked thequestion – what is new with this trend? Thissection will highlight the traditional use ofbusiness analytics in the old ‘pre-Big Data’world, versus Big Data analytics in the ‘NewWorld’. This will also look at the various usecases that IDC expects to see being mostcommonly used across a variety of industries.The Skill Factor – the Rise of theData Scientist. With the raft of newtechnologies and organisational structuresthat need to be put in place as the Big Dataphenomenon becomes a reality, there will beincreasing demand for ‘data scientists’ – thenext-generation analytical professionals whoare able to extract information from large datasets and then present value-added content ofbusiness value to non-data experts – who alsohave the unique skill of understanding the newmodels that need to be put in place.Mapping out the Big Data AnalyticsJourney. The Big Data analytics journey willbe an iterative one – it is therefore importantto map this out in the context of a broaderframework. This section aims to do exactlythat, and also provide some recommendationsto CIOs as they embark on this exciting journeyinto the brave new world of Big Data analytics.Situation OverviewThe Rise of Business AnalyticsMuch has been written on how the amount of data in the world isexploding in volume. According to the recent IDC Digital Universestudy, the amount of information created and replicated will surpass 1.9zettabytes (1.8 trillion gigabytes) in 2011 – growing by a factor of 9 in justfive years.Big data is a dynamic that seemed to appear fromalmost nowhere. But in reality, Big Data is not new– and it is moving into mainstream and getting alot more attention. The growth of Big Data is beingenabled by inexpensive storage, a proliferation ofsensor and data capture technology, increasingconnections to information via the cloud andvirtualised storage infrastructure, as well asinnovative software and analysis tools. It isno surprise then that business analytics as atechnology area is rising on the radars of CIOsand line-of-business (LOB) executives. To validatethis, as part of a recent survey of 5,722 end usersin the US market, business analytics rankedin the top five IT initiatives of organisations.The key drivers for business analytics adoptionremained conservative or defensive. Thefocus on cost control, customer retention andoptimising operations is likely a reflection ofthe continued economic uncertainty. However,2

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOAccording to more than 1000 CIOs and LOBexecutives that were interviewed as part of theAsia/Pacific C-Suite Barometer in February 2011,business analytics was rated as the numberone technology area that would enable theirorganisations to gain a competitive edge in theyear ahead.top drivers vary significantly by organisationsize and industry. Similarly, IDC surveyed 693European organisations in February 2011 where51% of respondents said that BI and analyticsare high-priority technologies. In emergingmarkets such as Asia/Pacific, the focus is verymuch on capturing the next wave of growth.Figure 1: The Rise of Business AnalyticsQ: You (CIO/CTO) mentioned ‘harnessing ICT to gain competitive advantage’ which of the following technologies or solutions would be your leading choice tobetter harness ICT?TOP 5Business intelligence/analyticsNetworkSocial media/online channelCollaboration(including video, mobility,)Cloud computing/services05101520253035%Source: IDC, 2011With more businesses in Asia investing in IT toride the hyper growth wave in emerging markets,they are harnessing analytics-led solutions to gainbetter customer insights, manage risk and financialmetrics more effectively, and at the same time,strive for unique market differentiation. Historically,organisations have made significant investmentsin applications with the objective of automatingbusiness processes and capturing data to improveoperational efficiency. Many of these projects arestill ongoing, but what is becoming increasinglyclear to the senior management of these entitiesis that they (and their business managers) havenot been able to get hold of the right information(mainly due to poorly integrated systems andquestionable data quality) at the right time (dueto performance and scalability issues) to the rightstakeholders within their organisations for thecritical decision-making capabilities needed todrive the necessary business impact. And wherethey are unable to do this, the line of business isprocuring and deploying their own solutions in anew wave of ‘shadow IT’ investments focusingon business analytics, thereby forcing CIOs tore-examine these issues with a specific focus ondriving better IT-business alignment. These aretaking place even without the ‘Big Data’ dynamicin the picture – which when added, creates the‘perfect storm’ for Big Data analytics to takecentre stage.3

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOA Note on Terminology:BI or Analytics?We have some challenges when defining andusing terminology for business analytics. Becausethe BI market is mature, many terms have beenaround for a long time and have either becomeobsolete or have been redefined over the years.For example, the term ‘BI’ itself is sometimesused in a narrow sense (only query, reporting,and analysis [QRA] technology) and at times,in a broad sense to refer to the whole of whatIDC calls business analytics (including datawarehousing and analytic applications in additionto front-end tools). The term ‘analytics’ is relativelynew and its meaning is often unclear — does itrefer to advanced analytics including predictiveanalytics, optimisation and forecasting, or analyticapplications? In some submarkets, such as Webanalytics, the term ‘analytics’ simply means adashboard on top of some data.For the purpose of this white paper, weinterpret ‘BI’ to mean either QRA tools orBI across the board (in its narrow definition),or ‘business analytics’ (in its broad definition)in IDC terminology. We interpret ‘analytics’ tomean either advanced analytics (data mining,statistics, optimisation and forecasting) or analyticapplications (FPSM, CRM and marketing analytics,supply chain analytics, etc.). Business Analytics isa combination of the above (and also includes datawarehousing technologies) and this is highlightedby IDC’s Business Analytics Taxonomy for 2011(see figure 2 below):Figure 2: IDC Business Analytics TaxonomyPerformance Management & Analytic ApplicationsFinancial Performance& Strategy ManagementCRM Analytic ApplicationsBudgeting, Planning, Consolidation,Profitability, Strategy ManagementSupply ChainAnalytic ApplicationsProcurement, logistics,inventory, manufacturingProduction PlanningAnalytic ApplicationsDemand, supply, andproduction planningBusiness Intelligence ToolsSales, Customer Service,Contact Centre, Marketing, Web SiteAnalytics, Price OptimisationQuery, Reporting,and Analysis ToolsDashboards, production reporting,OLAP, ad-hoc queryServices OperationsAnalytic ApplicationsAdvanced Analytics ToolsFinancial services, education,government, healthcare,communications services, etc.Workforce AnalyticApplicationsData mining and statisticsContent Analysis ToolsSpatial InformationAnalytics ToolsData Warehouse Management PlatformData Warehouse ManagementData Warehouse GenerationData extraction, transformation, loading; data qualitySource: IDC, 20114

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIODefining ‘Big Data’Big Data is not so much about the content thatis created, nor is it even about consumption. Itis more about the analysis of the data and howthat needs to be done. It is not really a ‘thing’, butinstead a dynamic/activity that crosses many ITborders. IDC defines Big Data in this way:“Big Data technologies describe a new generationof technologies and architectures, designedto economically extract value from very largevolumes of a wide variety of data, by enabling highvelocity capture, discovery and/or analysis.”Figure 3: Defining ‘Big Data’UnstructuredData (Video,rich media etc)DataVolumesSemi-Structured(e.g. Weblogs,social media feeds)Data Big, Complex,High Velocity &Wide VarietyTimeThe Volume. One is embodied more in thestructured data realm. Some of this is held intransactional data stores and is linked to theever-present electronic trail that individualsand businesses create in the wake of rapidlyincreasing online activity. Sensory data(machine-to-machine) contribute to this areatoo. The other is in existing data warehousesor data marts, which have over time grown topetabyte scale.The Variety. The other aspect of this BigData phenomenon is the need to analysesemi-structured and unstructured data.Text, video and other forms of media willrequire a completely different architectureand technologies to perform for the requiredanalysis. For example, if you look at thesocial media phenomenon, many marketingdepartments are looking at ways to dosentiment and brand analysis based onwhat is being posted on Facebook, Twitterand YouTube. This dynamic becomes moreSource: IDC, 2011complex in Asia with local social media siteslike RenRen in China and Nate in Korea.The Velocity. There will also be demand toanalyse this data on a more regular basis – forexample, taking into account all transactionsrather than a sample to obtain a morecomplete view of risk on a trade in real time.In summary, Big Data refers to data sets whosevolume, variety, velocity and complexity make itimpossible for current databases and architecturesto store and manage. IDC intentionally does notdefine Big Data as larger than a certain threshold(i.e. terabytes), mainly since this threshold wouldbe a moving target depending on the sector, aswell as the fact that it will obviously grow over time.More important is the value that organisations canderive from this phenomenon – and the resultingneed to rethink their information strategies toextract the value.5

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOOther Definitions:Hadoop, Mapreduce, Key Value StoreWith the focus on Big Data going mainstream, a range of new technologies have hit the market. The tablebelow gives an overview of these technologies, with associated context (note that the list is not exhaustive).Table 1: Big Data Technologies (Terminology)TechnologyContextBig TableProprietary distributed database system built on the Google File System.Inspiration for HBase.CassandraAn open source (free) database management system designed to handlehuge amounts of data on a distributed system. This system was originallydeveloped at Facebook and is now managed as a project of the ApacheSoftware foundation.Data Warehouse &Analytical ApplianceConsists of an integrated set of servers, storage, operating system(s),database, business intelligence, data mining and other softwarespecifically pre-installed and pre-optimised for data warehousing.Distributed SystemMultiple computers, communicating through a network, used to solve acommon computational problem. The problem is divided into multipletasks, each of which is solved by one or more computers working inparallel. Improved price:performance ratio, higher reliability and morescalability.Google File SystemProprietary distributed files system developed by Google: part of theinspiration for Hadoop.HadoopAn open source (free) software framework for processing huge data setson certain kinds of problems on a distributed system. Its development wasinspired by Google’s MapReduce and Google File System. It was originallydeveloped at Yahoo! and now managed as a project of the ApacheSoftware Foundation.HBaseAn open source (free) distributed, non-relational database modeled onGoogle’s Big Table. It was originally developed by Powerset and is nowmanaged as a project by the Apache Software Foundation as part ofHadoop.MapReduceA software framework introduced by Google for processing huge data setson certain kinds of problems on a distributed system. Also implemented inHadoop.Non-relational database/Key Value StoreA non-relational database is one that does not store data in tables (rowsand columns) – in contrast to a relational database. Key Value Stores allowfor the management of schema-less (noSQL) entities.Although some of these terms will be usedthroughout this white paper, the focus is not toexamine them in too much detail – because asone IT executive recently mentioned – ‘to knowthe technology is one thing, but to apply it in theright environment is something entirely different’.The new technology needs to be tied back tobusiness requirements as much as possible – notjust examining the technology for the sake ofit. Having said that, most IT executives are notaware of the technologies and trends developingin this area – and where they are aware of it,their strategy is to put a couple of people in theirenterprise architecture team to experiment withthe new technologies (i.e. in memory, Hadoop,MapReduce, Key Value Stores etc) that are beingused to deal with the ‘Big Data’ phenomenon.6

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOBig Data Analytics:The Old World vs. The New EraMany have asked the question – what is new withthis trend? This section highlights the traditionaluse of business analytics in the old ‘pre-Big Data’world, versus Big Data analytics in the ‘BraveNew World’. This will also look at the various usecases that IDC expects to see being used mostcommonly across a variety of industries. Themajority of IT organisations have progressed interms of their infrastructure architectures overtime; from predominantly mainframe-basedenvironments in the 1980s to a focus on clientserver in the 1990s and the Web at the turn of thecentury, to what is now popularly known as ‘privatecloud’. This supposed state of ‘nirvana’ constitutesa consolidated, virtualised set of infrastructureresources (server, storage and network) that canbe self-provisioned in an automated fashion bybusiness users – complete with SLAs that havethe security, performance, availability and costprofiles transparent to all in the form of a servicecatalog. Very few organisations, if any, haveachieved this state of infrastructure ‘nirvana’,and are still battling with a spaghetti-like tangleof compute resources in their datacenter. Andnow, we have this external force of Big Data asmentioned earlier that is forcing CIOs to rearchitect their infrastructure – particularly in thecontext of how analytics capabilities are deployedin an enterprise-wide fashion.Below is an overview of the changes that IDCsees happening in the infrastructure world thatis increasingly impacting the Big Data analyticsworld:Table 2: Old World vs. New Era (Big Data Infrastructure)Old WorldNew EraTenancyInfrastructure SilosPooled resourcesArchitecturePerformance ‘tuned’Linear scalability (linked todistributed parallel processing and‘in memory’ storage)Delivery ModelOn PremiseHybrid (with cloud bursting capabilities)and widespread use of the appliance7

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOBased on IDC’s research in this space, here arethree suggestions for CIOs in dealing with theseissues:Cloud Bursting. The private cloudjourney will line up well with the enterprisewide analytical requirements highlightedearlier, but CIOs need to ensure that workloadassessments are conducted rigorously andthat risk is mitigated where possible. Criticalto this approach will be the evaluation of cloudbursting capabilities from external vendors(i.e. Infrastructure as a service), particularly asorganisations start to leverage more real-timeanalytics environments, to ensure that the useof infrastructure resources maps closely todemand – and that there are no issues in termsof performance and availability.Analytical Appliance. In terms of deliverymodels, IDC has seen significant performancebenefits from analytical appliances forcustomers that are dealing with the impactof Big Data. In addition, since the software isoptimised and pre-integrated with appliances,the deployment timeframes are typicallyshorter. As part of a recent global survey ofCIOs, 10% of the respondents indicated thatthey will be looking at analytical appliances asa delivery model in 2011. IDC also believesthat the demand for reference architectures willrise as CIOs look to integrate these applianceswithin existing data warehousing environments.In line with this increased adoption of theanalytical appliance as a delivery model, IDCbelieves that IT departments will allocate lessbudget towards technical skills (i.e. installation,configuration and management), and more onthe high-end analytical skills needed to helpdrive the necessary business impact acrossmultiple functions.Enterprise Architecture. Enterpriseanalytics needs an enterprise architecturethat scales effectively with growth – and therise of Big Data analytics means that thisissue needs to be addressed more urgently.Organisations need to look at creating a‘high performance analytical environment’that leverages in-database analytics, parallelprocessing as well as in-memory storage todeal with the increased volume, velocity andvariety of data. Particularly, in terms of dealingwith unstructured data, more attention needs tobe paid to Hadoop – an open source softwareframework set up by Apache that allows for thedistributed processing of large data sets acrossclusters of computers. However, there will bean ongoing tension between global standardsand local requirements – and the use ofHadoop would be a good example of this.Another would be the ability to process mixedworkloads (e.g. analytical and operational)in the same infrastructure environment suchas the appliance that was mentioned earlier.CIOs need to consider ways in which theycan deliver value in terms of solving specificbusiness problems, while at the sametime, being cognizant of global architecturestandards and specifications. While certainglobal governance models will not allow forthe usage of some of these technologies in aproduction environment, business expectationswill force IT departments to re-assess the waythe enterprise architecture agenda is utilised ata local level.8

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOanalytics journey. But the impact is potentiallyenormous. If you look at optimising the price onevery item in a global retail chain or detectingfraud in real time – you get a sense of the type ofproblems that Big Data analytics can be used tosolve.The bottom line here is that it is getting morecomplicated to process and analyse these large,complex and growing data sets – and it essentiallyrequires a re-assessment of the broaderinformation management strategy for the majorityof organisations that have started their businessTable 3: Old World vs. New Era (Big Data Analytics)Old WorldNew EraData SetsPredefinedAll-encompassing and iterativeData VelocityBatchProactive and dynamic (real-timewhere appropriate)Data AnalysisPredominantly HistoricPredictive, Forecasting & Optimisationcases can be best mapped out across two ofthe Big Data dimensions – namely velocity andvariety as outlined below:However, despite the clear potential of suchanalytics – it is important to understand that itwill not necessarily be relevant or applicableto every use case. IDC believes that these useFigure 4: Potential Use Cases for Big Data AnalyticsReal timeCredit & Market Risk in BanksFraud Detection (Credit Card) & Financial Crimes (AML) in Banks(including Social Network Analysis)Event-based Marketing in Financial Services and TelecomsMarkdown Optimization in RetailClaims and Tax Fraud in Public SectorDataVelocityPredictiveMaintenance inAerospaceDemand Forecastingin ManufacturingTraditional DataWarehousingSocial MediaSentiment AnalysisDisease Analysison Electronic HealthRecordsText MiningVideo dUnstructuredData Variety9

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOA better sense of the potential impact of deployingBig Data analytics to drive high value impact canbe derived by exploring these use cases in moredetail:Real-time Fraud Detection in Banks.Involves the ability to detect, prevent andmanage fraud across multiple products, linesof business and channels for a bank. Thisrequires the ability to capture the historyfor different types of entities (e.g. card,account, customer, terminal ID or IP address)involved in transactions, amplifying accuracyin detecting customer behaviours that falloutside the norm during point-of-sale (POS)transactions. This information can be used bymultiple predictive models, for fraud detectionand credit risk assessment.Markdown Optimisation in Retail.The ability for retailers to optimise prices for awide range of products in real time based ondemand forecasting scenarios (that includethe impact of promotions, seasonality andimportant calendar events) has a major impacton margins. These capabilities can also beaugmented by social media sentiment analysisto ascertain customer demand for certainproducts on a more real-time basis.Disease Analysis on ElectronicHealth Records. As healthcare servicesevolve, analysts can get hold of a patient’sentire medical history in electronic format.This will present a major opportunity for BigData analytics. For example, in the case ofa disease such as diabetes, the ability tocorrelate patient medical history with dietarydata (potentially from market basket analysisin retail) and optimised exercise schedules willprovide medical practitioners with new insightsthat they had only previously dreamt of.The Skill FactorAs highlighted earlier, IDC believes that the real value from Big Data will be derived from the high-endanalytics that can be performed on the increasing volumes, velocity and variety of data that organisationsare generating. In Asia (outside some of the MNCs because this is mainly being driven out of the US andEurope), most organisations are not aware of the type and level of skills that are required. IDC also believesthat this is linked to the general lack of awareness and skill available historically in the high-end analyticsarena (regardless of the Big Data phenomenon).High-end analytics will require new sets ofskills in two key categories:Technical skills. For the new class oftechnologies required to process, discover andanalyse these massive data sets that cannotbe dealt with using traditional databasesand architectures (i.e. in memory, Hadoop,MapReduce, Key Value Stores etc). Someof these technologies will be delivered as anappliance – and skills to better understand howthe software interacts with the hardware toleverage the data will be required.The new type of business analyst/statistician. One of the key differencesbetween analytics in the ‘Old World’ and whatwe are dealing in terms of the Big Data erais that we are gathering data that we mayor may not need – and from the perspectiveof analysis, this means ‘we don’t know10

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOwhat we don’t know’ – i.e. there is so muchunstructured data that the variables andanalytical models are likely to be entirely new.This means that there is a need to re-thinkthe way the analytical power users approachtheir work by creating a ‘Sandbox Mentality’where discovery is always the starting point.Generally, a background in data mining andstatistics would be a good starting point forthis type of analysis. Moving forward, therewill be increasing demand for ‘data scientists’– the next-generation business analyst withstrong statistical skills who are able to extractinformation from large data sets and thenpresent value to non-analytical experts – butwith the unique skill of understanding the newalgorithms and analytical models that willhave the most significant business impact inthe short term. Globally, IDC is seeing a lot ofinterest in this more analytically inclined skillset. Roles and responsibilities have not beendefined – but it basically fits in with the earliercomments in terms of ‘we don’t know what wedon’t know’ – i.e. there is so much unstructureddata that the variables and analytical modelsare likely to be entirely new. It requires a very‘out-of-the-box’ type and creativity in terms ofthe analytics that needs to be done on thesenew data types and structures.For example, if you look at the social mediaphenomenon (contributing to the semi-structuredand unstructured data part of Big Data), manymarketing departments are looking at ways to dosentiment and brand analysis based on what isbeing posted on Facebook, Twitter and YouTube(massive amounts as you can expect). Thisdynamic becomes more complex in Asia with localsocial media sites like RenRen in China and Natein Korea. Currently, IT is not the first port of callfor the chief marketing officer since it lacks theskills to understand what needs to be done (andin many cases, is still trying to work out what roleit should play in the policy or governance of theuse of social media). So the make-up of the ITdepartment needs to be re-assessed in terms oftechnical, business and relationship skills.The maturity model below highlights how IDC seesthese skills (both technical and business) mappingout in the context of the organisations that haveadopted business analytics over time – with aview to how this could evolve in the era of BigData analytics:11

Big Data Analytics:Future Architectures, Skillsand Roadmaps for the CIOFigure 5: The Big Data Analytics Maturity ModelOld WorldPhaseNew ticsBig DataAnalyticsStaff Skills (IT)Little or no expertise inanalytics – basic knowledgeof BI toolsData warehouse teamfocused on performance,availability and securityAdvanced data modelersand stewards key part of theIT departmentBusiness AnalyticsCompetency Centre (BACC)that includes ‘data scientists’Staff Skills(Business/IT)Functional knowledgeof BI toolsFew business analysts –limited usage of advancedanalyticsSavvy analytical modelers andstatisticians utilisedComplex problem solvingintegrated into BusinessAnalytics CompetencyCentre (BACC)Technology& ToolsSimple historical BIreporting and dashboardsData warehouse implemented,broad usage of BI tools, limitedanalytical data martsIn database mining,and limited usage of parallelprocessing and analyticalapplianceWidespread adoptionof appliance for multipleworkloads. Architecture andgovernance for emergingtechnologiesFinancialImpactNo substantial financial impact.No ROI models in placeCertain revenue generatingKPIs in place with ROI clearlyunderstoodSignificant revenue impact(measured and monitored on aregular basis)Business strategy andcompetitive differentiationis based on analyticsDataGovernanceLittle or none (Skunk works)Initial data warehouse modeland architectureData definitions and modelsstandardisedClear master datamanagement strategyLine ofBusiness (LOB)FrustratedVisibleAligned (includingLOB executives)Cross-departmental(with CEO formative% of Customers(IDC Estimates)20%65%In terms of capturing and developing the rightskills in the era of Big Data analytics, t

big data is a dynamic that seemed to appear from almost nowhere. But in reality, Big Data is not new – and it is moving into mainstream and getting a lot more attention. the growth of Big Data is being enabled by inexpensive storage, a proliferation of sensor and data capture technology