Data And AI Platforms Building A High- Performance Data . - Databricks

Transcription

Produced in partnership withHow data and analyticsleaders are deliveringbusiness results with clouddata and AI platformsBuilding a highperformance dataand AI organizationAI

2  MIT Technology Review InsightsPreface“Building a high-performance data and AI organization” is an MIT Technology ReviewInsights report sponsored by Databricks. To produce this report, MIT TechnologyReview Insights conducted a global survey of 351 chief data officers, chief analyticsofficers, chief information officers, and other senior technology executives. Therespondents are evenly distributed among North America, Europe, and Asia-Pacific.There are 14 sectors represented in the sample and all respondents work inorganizations earning 1 billion or more in annual revenue. The research alsoincluded a series of interviews with executives who have responsibility for theirorganizations’ data management, analytics, and related infrastructure. DenisMcCauley was the author of the report, Francesca Fanshawe was the editor, andNicola Crepaldi was the producer. The research is editorially independent, and theviews expressed are those of MIT Technology Review Insights.We would like to thank the following individuals for providing their time and insights:Patrick Baginski, Senior Director Data Science, McDonald’s (United States)Bob Darin, Chief Data Officer, CVS Health, and Chief Analytics Officer, CVSPharmacy (United States)Naveen Jayaraman, Vice President – Data, CRM & Analytics, L’Oréal (United States)Michel Lutz, Group Chief Data Officer, Total (France)Mainak Mazumdar, Chief Data and Research Officer, Nielsen (United States)Andy McQuarrie, Chief Technology Officer, Hivery (Australia)Sol Rashidi, Chief Analytics Officer, The Estée Lauder Companies (United States)Ashwin Sinha, Chief Data and Analytics Officer, Macquarie Bank (Australia)Don Vu, Chief Data Officer, Northwestern Mutual (United States)

CONTENTSMIT Technology Review Insights01 Executive summary.402 Growth and complexity.6Databricks perspective: The rise of the lakehouse effect. 703 Aligning and delivering on strategy.9Data high-achievers. 11Nielsen: data transformation for a data-reliant business. 1304 Scaling analytics and machine learning.14A paradigm shift at CVS Health.15Barriers to scale. 16Protecting return on investment.17Technology, democracy, and culture. 1805 Visions of the future.19A CDO wish-list for a new architecture. 1906 Conclusion.213

4  MIT Technology Review Insights01ExecutivesummaryCxOs and boards recognize that theirorganization’s ability to generate actionableinsights from data, often in real-time, is of thehighest strategic importance. If there wereany doubts on this score, consumers’accelerated flight to digital in this past crisis year havedispelled them. To help them become data driven,companies are deploying increasingly advanced cloudbased technologies, including analytics tools withmachine learning (ML) capabilities. What these toolsdeliver, however, will be of limited value without abundant,high-quality, and easily accessible data.In this context, effective data management is one of thefoundations of a data-driven organization. But managingdata in an enterprise is highly complex. As new datatechnologies come on stream, the burden of legacysystems and data silos grows, unless they can beintegrated or ring-fenced. Fragmentation of architecture isa headache for many a chief data officer (CDO), due notjust to silos but also to the variety of on-premise andcloud-based tools many organizations use. Along with poordata quality, these issues combine to deprive organizations’data platforms—and the machine learning and analyticsmodels they support—of the speed and scale needed todeliver the desired business results.AITo understand how data management and thetechnologies it relies on are evolving amid suchchallenges, MIT Technology Review Insights surveyed 351CDOs, chief analytics officers (CAOs; we refer to theseand CDOs as “data leaders” at various points in the report)as well as chief information officers (CIOs), chieftechnology officers (CTOs), and other senior technologyleaders. We also conducted in-depth interviews withseveral other senior technology leaders. Following are thekey findings of this research: Just 13% of organizations excel at delivering on theirdata strategy. This select group of “high-achievers”deliver measurable business results across theenterprise. They are succeeding thanks to their attentionto the foundations of sound data management andarchitecture, which enable them to “democratize” dataand derive value from machine learning. The foundationsensure reduced data duplication, easy access to relevantdata, the ability to process large amounts of data at highspeeds, and improved data quality. The high-achieversare also advanced cloud adopters, with 74% running halfor more of their data services or infrastructure in a cloudenvironment.

MIT Technology Review InsightsOrganizations’ top data priorities over the next two years fall intothree areas, all supported by wider adoption of cloud platforms:improving data management, enhancing data analytics and ML,and expanding the use of all types of enterprise data, includingstreaming and unstructured data. Technology-enabled collaboration is creating a workingdata culture. The CDOs interviewed for the study ascribegreat importance to democratizing analytics and MLcapabilities. Pushing these to the edge with advanceddata technologies will help end-users to make moreinformed business decisions—the hallmarks of a strongdata culture. This is only possible with a modern dataarchitecture. One CDO sums it up by saying thatsuccessful data management is achieved when the rightusers have access to the right data to quickly generateinsights that drive business value. ML’s business impact is limited by difficultiesmanaging its end-to-end lifecycle. Scaling ML usecases is exceedingly complex for many organizations.The most significant challenge, according to 55% ofrespondents, is the lack of a central place to store anddiscover ML models. That absence, along with errorprone hand-offs between data science and productionand a lack of skilled ML resources—both cited by 39% ofrespondents—suggest severe difficulties in makingcollaboration between ML, data, and business-userteams a reality. Enterprises seek cloud-native platforms that supportdata management, analytics, and machine learning.Organizations’ top data priorities over the next two yearsfall into three areas, all supported by wider adoption ofcloud platforms: improving data management, enhancingdata analytics and ML, and expanding the use of all typesof enterprise data, including streaming and unstructureddata. For “low-achievers”—organizations having difficultydelivering on data strategy—improving data managementovershadows all other priorities, cited by 59% of thisgroup. Most high-achievers, by contrast (53%), arefocused on advancing their ML use cases. Open standards are the top requirement of future dataarchitecture strategies. If respondents could build anew data architecture for their business, the most criticaladvantage over the existing architecture would be agreater embrace of open-source standards and opendata formats. Data leaders now realize the value ofopen-source standards to accelerate innovation andenable choice in leveraging best-of-breed third-partytools. Stronger security and governance, not surprisingly,are also near the top of respondents’ list of requirements.5

6  MIT Technology Review Insights02Growth andcomplexityThe pace of change in how organizationsmanage their data has been both breathtakingand frustrating. Once viewed by seniormanagement as a byproduct of operations,data is now regarded as a supreme driver ofbusiness value. The volumes of data generated continueto grow at a rapid pace across structured, semistructured, and unstructured data types that businessesare now able to store and need to analyze.Whereas not long ago organizations relied on a fewtechnology giants to meet their needs for datainfrastructure and tools, enterprise customers today arespoiled for choice from among hundreds of providers in avast data ecosystem. These players continuously developnew analytics tools—now often powered by machinelearning—that parse data at unprecedented speed, depth,and sophistication. Ever-expanding clouds provideorganizations with vast space to store, and enormouspower to crunch, their data, and in an increasingly costefficient manner. Last but not least, new roles andstructures have emerged at different levels—witness therise of chief data officers (CDOs) and chief analyticsofficers (CAOs), among others—to channel theorganization’s data capabilities toward creating newbusiness value aligned with its strategic objectives.“It used to be difficult and costly to me to get data aboutmany elements of our customer experience,” says BobDarin, chief data officer of CVS Health (and chief analyticsCloud, once considered anoptional technology environment,is today the foundation formodernizing data management:63% of respondents use cloudservices or infrastructure widelyin their data architecture.officer of CVS Pharmacy). “Now I can get insights about ourcustomers, about our supply chain, about how people workthat I just couldn’t capture before. We have all the tools toanalyze that data at scale, and the cost of those tools iscoming down. This allows us to develop insights at a greatscale and integrate them, so they are part of our patient andcustomer workflows, enabling us to provide a morepersonalized and relevant experience for our customers.”Cloud, once considered an optional technologyenvironment, is today the foundation for modernizing datamanagement, providing ever greater storage andcomputing power at declining cost. Among the companiesin our survey, 63% use cloud services or infrastructurewidely in their data architecture. Of these, just overone-third (34%) operate multiple clouds.Nevertheless, frustrations abound with data management.As enterprises seek to upgrade their data platforms, manyremain saddled by legacy, on-premise silos that resisteasy integration, incur high costs, or cause problems

MIT Technology Review InsightsPartnerperspectiveDatabricks perspective:The rise of the lakehouse effectEvery company feels the pull to becomea data company, and they are placingincreasing importance on AI to deliver onthe tremendous business potential it canoffer. But, as indicated in this report,only 13% of organizations today are succeeding atdelivering on their enterprise data strategy. Data andanalytics leaders attribute much of their successto having a solid handle on data management basics.So why do so many others struggle?The technology ecosystem across data warehousesand data lakes further complicates the architecture.It ends up being expensive and resource-intensiveto manage. That complexity impacts data teams.Data and organizational silos can accidentlyslow communication, hinder innovation and createdifferent goals amongst the teams. The resultis multiple copies of data, no consistent security/governance model, closed systems, and lessproductive data teams.The challenge starts with the data architecture.The research suggests organizations need tobuild four different stacks to handle all of their dataworkloads: business analytics, data engineering,streaming, and ML. All four of these stacks requirevery different technologies and, unfortunately, theysometimes don’t work well together.Meanwhile, ML remains an elusive goal. With theemergence of lakehouse architecture, organizationsare no longer bound by the confines and complexityof legacy architectures. By combining theperformance, reliability, and governance of datawarehouses with the scalability, low cost, andworkload flexibility of the data lake, lakehouseContinued, next pageowing to data duplication and poor quality. This creates agood deal of complexity when it comes to datainfrastructure. The cloud, for all its game-changing impact,can also increase complexity as organizations continue tostore their data with multiple providers to hedge vendorlock-in, meet regional needs, or optimize for best-of-breedsolutions. And data architectures have evolved in arelatively short space of time so that organizations maysimultaneously be using on-premise databases, datawarehouses, data lakes, or other emerging dataarchitectures along with different cloud-based toolsperforming configuration, governance, or other functions.“Architectures have gotten really complicated, but onlybecause we tend to over-complicate them,” saysSol Rashidi, chief analytics officer at The Estée LauderCompanies. “We do this because we lose sight of whatmatters most. We too often bring in the latest and greatestin technology and platforms, thinking they will solve theproblem. But unless the business is ready to leverage thetools, has the maturity to extract the insights, andprocesses and logic are agreed upon, we’re only adding tothe spaghetti architecture.”If organizations are unable to manage the complexity, theconsequences are usually a combination of missedopportunities (in the failure of ML use cases to deliverreturns, for example), higher costs (such as fromadministering and supporting multiple overlappingsystems), difficulty meeting the growing regulatoryrequirements on data, and, ultimately, considerableexposure to competition.Nevertheless, as our research makes clear, enthusiasmand optimism outweigh any sense of frustration amongdata and technology leaders when it comes to theirpresent and future ability to manage data effectively fortheir business.7

8  MIT Technology Review InsightsPartnerperspectivearchitecture provides a flexible, high-performancedesign for diverse data applications—includingreal-time streaming, batch processing, SQLanalytics, data science, and ML.At Databricks, we bring the lakehouse architectureto life through the Databricks Lakehouse Platform.The key enabler behind this innovation is Delta Lake.Delta Lake is at the core of the platform, and itcreates curated data lakes that add reliability,performance, and governance from datawarehouses directly to the existing data lake.Organizations get a better grasp on enterprise-widedata management.It’s collaborative: Data engineers, analysts,and data scientists can work together and moreefficiently.The cost savings, efficiencies, and productivity-gainsoffered by the Databricks Lakehouse Platform arealready making a bottom-line impact on enterprisesin every industry and geography. Freed from overlycomplex architecture, Databricks provides onecommon cloud-based data foundation for all dataand workloads across all major cloud providers.Data and analytics leaders can foster a data-drivenculture that focuses on adding value by relieving thedaily grind of planning and all its complexities, withpredictive maintenance.The Databricks Lakehouse Platform excels in 3 ways:It’s simple: Data only needs to exist once to supportall workloads on one common platform.It’s open: Based on open source and openstandards, it’s easy to work with existing tools andavoid proprietary formats.From video streaming analytics to customer lifetimevalue, and from disease prevention to finding life onMars, data is part of the solution. Understandingdata is the key that opens the doors.Late 1980’s20112020Data WarehouseData LakeLakehouseReportsBIData MartsETLExternal DataOperational DataData MachineScience LearningData Prepand ValidationReal-TimeDatabaseETLReportsBIData MartsStreaming Analytics Data ScienceBIMachine Learning01010110100101101001010010101 1 0 1 1 0 1 0 1 0 0101101011001001000011010110101100010101001101 1010Structured, Semi-Structured and Unstructured DataStructured, Semi-Structured and Unstructured DataSOURCE: DATABRICKS0

03MIT Technology Review InsightsAligning anddeliveringon strategyAmid a global economic downturn of a scalenot seen for nearly a century, businessesmight be expected to be reining in theirambitions and focusing on the bottom line.Many of those represented in our survey,however, appear growth oriented. When asked about themost important business objectives they have set fortheir enterprise data strategy over the next two years,more respondents stress top-line growth, in the formof expanded sales and service channels (cited by 45%)than those who point to improved efficiency (43%).Following closely (at 42%) is improving innovation andreducing time to market of new or improved products. A look at the surveyed firms’ principal data initiatives overthe next two years suggests a substantial degree ofalignment with a growth-oriented business strategy. It alsoreflects their recognition of the urgency of improving datamanagement in order to support those businessobjectives. The wider adoption of cloud-native platformswill underpin this and other initiatives.The most frequently cited priority is achieving better datamanagement by improving data quality and processing,mentioned by 48% of respondents. (That figure is 74%among those working in oil and gas companies and 67% inconsumer products firms.) Such efforts are critical toFigure 1: Companies’ most important business objectives for enterprise datastrategy over the next two years (top responses; % of respondents)Expand sales and services channels45%43%45%Improve operational efficiency43%46%44%Improve innovation and reduce time to market42%42%43%Improve maintenance of physical assets34%32%Enter new product or service markets33%32%Improve ESG33%31%TotalNorth AmericaMIT Technology Review Insights survey, 202135%28%37%46%38%43%34%39%31%EuropeAsia-Pacific9

10MIT Technology Review InsightsAchieving better data management by improving data qualityand processing is critical to enabling growth-oriented efforts,like those driven by ML, to move ahead at speed.enabling growth-oriented efforts, such as those driven byML, to move ahead at speed. For Hivery, an Australiabased retail technology firm whose products are poweredby artificial intelligence (AI), the quality of its customers’data matters even more than its ability to ingest largevolumes. In fact, says Andy McQuarrie, Hivery’s chieftechnology officer, the cleaner its customers’ data, thefewer ingestion problems it encounters.The other top data priorities of the surveyed firms—increasing the adoption of cloud platforms (cited by 43%),enhancing data analytics (43%), and expanding theapplication of ML (42%)—if met, will provide data teamswith additional capacity, power, and scale to, among otherthings, quickly tap new sales and service opportunitiesand support new data product development. They also, ofcourse, fully support the goal of improving operationalefficiency. Another priority (cited by 38%) is expanding theuse of streaming, unstructured, and other varieties of data.That data strategy should be closely aligned with theoverall business objectives seems self-evident today, butthe importance of alignment has not always been clear.According to Don Vu, chief data officer of US financialservices firm Northwestern Mutual, alignment of data andbusiness strategy has become much tighter at manycompanies as CDOs have exerted their influence, anddata responsibilities have been brought together instreamlined organizational structures. At his firm, saysVu, “people knew that alignment was important, but thatbecame crystallized as our teams dug deeper into howwe’re actually going to deliver on the various businessstrategy initiatives. The link to business strategy fromnotions such as trust in essential sources of truth, ordemocratizing the use of data, became much clearer.”Figure 2: Companies’ most important enterprise-wide data strategy initiatives over the nexttwo years (top responses; % of respondents)48%51%50%43%52%50%44%43%35%Improve data qualityand processingIncrease adoption ofcloud platformsMIT Technology Review Insights survey, 202143%41%36%Enhance dataanalyticsTotal42%39%42%44%38%Expand applicationof MLNorth America36%39% 39%Expand usage of alldata (e.g. streaming andunstructured data)EuropeAsia-Pacific

MIT Technology Review InsightsData high-achieversNot many large enterprises excel at data management.This is reflected in the survey, where only 13% ofrespondents rate their organization’s performancehighly when it comes to delivering on data strategy,scoring it at the top end (9-10) of a 1-10 scale. Thesedata “high-achievers” deliver with measurable businessimpact across multiple business units, say theirexecutives. They are contrasted with a similarly sizedgroup of “low-achievers” (12% of the sample), whosedata performance is rated at 6 or lower on the scale.Large gaps separate these two groups in certainattributes as well as intentions. For example, cloudfeatures more prominently in the data architecture ofFigure 3a: The extent to which organizations are successfully delivering on the enterprise data strategy(self-assessed rating on a 1-10 scale where 10 succeeding)RatingSucceeding chievers1%Low-achievers2%30%20%10%MIT Technology Review Insights survey, 2021Figure 3b: High-achievers: respondents rating their organizations 9 or 10 on their delivery ofenterprise data strategy, with measurable business impact across multiple business units(total, regions and selected industries)13%Total15%North America13%Asia-PacificEurope11%Financial services21%20%Government/public sectorLife sciences & health care17%Oil and gas17%Automotive & transportation16%Telecom16%MIT Technology Review Insights survey, 202111

12MIT Technology Review InsightsFigure 4: The main success factors enabling “high-achiever” organizations to deliver on theirdata strategy initiatives (top responses; % of respondents)47%Data duplication reducedEase of data access38%Fast processing of large amounts of data36%Data quality improved31%Easy collaboration across cross-functionalteams on all analytical use cases20%Ability to do analytics on all datawherever it resides20%MIT Technology Review Insights survey, 2021high-achievers: 74% of this group run at least half oftheir data services or infrastructure in a cloudenvironment, compared with 60% of low-achievers whodo the same. When it comes to data priorities, mostlow-achievers (59%) are focused on improving datamanagement (data quality and processing) over the nexttwo years, while high-achievers’ most frequently citedinitiative (by 53%) is expanding the application of ML.Duplication of data in large organizations happens atmultiple levels such as data warehouses, operationalsystems, reports, dashboards, and desktop tools. Thishas significant cost, risk management, and reliabilityimplications, says Ashwin Sinha, chief data and analyticsofficer at Macquarie Bank. Data duplication also impactsthe ability to scale and make effective use of machinelearning across the organization.Far from taking the basics for granted, the high-achieversattribute their success to their close attention to thefoundations of sound data management. These includethe reduction of data duplication, ease of data accessfor enterprise end-users—a hallmark of data“democratization”—and the processing of large datavolumes at high speeds.Asked what’s holding back their progress, the largestpercentage of low-achievers point to limited scalability oftheir data management platform. Other often-citedimpediments are slow processing of large data volumesand difficulties in facilitating collaboration. Achieving scale,speed, and collaboration, as we will see, are challenges fororganizations right across the span of data operations.Figure 5: The main challenges keeping “low-achiever” organizations from delivering on theirdata strategy initiatives (top responses; % of respondents)44%Data management platform does not easily scale39%Slow processing of large amounts of dataHard for cross-functional teams to collaborate on allanalytics use casesHigh data duplicationComplex and fragmented toolsfor MLMIT Technology Review Insights survey, 202129%22%20%

MIT Technology Review InsightsNielsen: data transformation for a data-reliant businessIt is hard to overestimate the importance ofsound data management to Nielsen, one of afew century-old organizations in which datahas been central to the business model fromday one. Nielsen’s panels tell consumer goodscompanies what products customers are buyingand how behaviors are changing. The panelsalso advise such companies on where and whenthey should place their television advertising.Now in his second year as the firm’s chief dataofficer and fifth as chief research officer, MainakMazumdar has presided over a transformationof its data management and infrastructure.“Just a few years ago,” he recalls, “we struggledwith fragmentation—lots of data in silos andtribal knowledge needed to access it—a lack ofmetadata and very little governance, all while datavolumes were growing by petabytes each day.”Mazumdar paints a different picture today: “Nowwe’re able to scale quickly from having 20-30specialists on a platform to 300-plus. We’re ona cloud platform with a data lake where data iscurated, labelled, defined, has metastores, and isconsolidated. We’ve built our own analytics engine.In fact, much of what was done by softwareengineers in the past is now done by my team,deploying directly into production.” The changes,says Mazumdar, have reduced his team’s cycletime by 50%. “The speed of our models is nowabout 50x. What used to take 20 minutes we cannow do in a minute or less. At the same time, weare ingesting and processing massive amounts ofdata, which are easily accessible to data science.It’s a huge change.”An example of how Nielsen has used thesecapabilities to enable growth is the roll-out of anew ratings product in the company’s roughly200 local markets in the US. Crunching largeamounts of data from TV set-top boxes, the “highrecognition deep-learning model” enables Nielsento predict for customers not only what a viewer islikely to watch at any given time, but also who ineach household is doing the watching, somethingnot previously possible. “We could not have rolledout this product without the changes made to howwe manage data, and we could not have ingestedsuch volumes of data,” says Mazumdar. “It’s withthat ingestion of ever greater volumes of datathat the models—and the product—get better andbetter.”13

14MIT Technology Review Insights04Scaling analyticsand machinelearningBusiness leaders know that their company’sability to keep pace with and anticipatedemand, to manage competitive pressures, toinnovate effectively, and to operate efficientlyis coming to rest on their mastery of analyticsand ML. Organizations in virtually every industry are busydeveloping analytics and ML use cases that will delivergreater business impact. For most large enterprises, awide portfolio of use cases in production and at scale isno longer a nice-to-have but a must-have. CDOs and theirteams are increasingly judged on their contributions todelivering such cases.Many organizations struggle with this, and particularlywith achieving the scale needed to generate a sizeableimpact. According to Sol Rashidi of Estée Lauder, onereason is over-ambition: “Too often companies want toskip crawling and walking with ML and go straight torunning, without having mastered the basics.” For otherCDOs, such as Don Vu of Northwestern Mutual, the keychallenges lie in selecting the right use cases to deployinto production. Without business user input, he says, theprobability rises of selecting cases that do not clearly mapto a business objective.Figure 6: The main difficulties companies encounter in scaling ML use cases(top responses; % of respondents)No central place to store and discoverML models55%Numerous types of deployments and error pronehand-offs between data science and production39%Lack of ML expertise39%32%A plethora of tools and frameworks28%Hard to explain and govern ML modelsOutdated models because of infrequentlyrefreshed dataAccess to relevant quality dataMIT Technology Review Insights survey, 202127%11%

MIT Technology Review InsightsA paradigm shift at CVS HealthPharmacies have always played anessential role in societies, but arguablynever more so than during the pastyear. Pharmacy chains such as CVS,America’s largest by revenue, are usingever more advanced data capabilities to ensurethat their customers are up to date on theirmedications and use of other health services, lestunderlying conditions lead to more serious healthconsequences. Bob Darin is leading many of thoseefforts as chief data officer of CVS Health andas chief analytics officer of its retail pharmacybusiness.The company has long used data systems to prodcustomers to stay current with their medications.This has involved, for example, patient outreachthrough phone calls and texts, prompts at thepharmacy counter, or recommendations for thepatient to talk with their health-care provider aboutspecific follow-up or medication reviews. In recentyears, those initiatives have become embed

cloud platforms: improving data management, enhancing data analytics and ML, and expanding the use of all types of enterprise data, including streaming and unstructured data. For "low-achievers—" organizations having difficulty delivering on data strategy—improving data management overshadows all other priorities, cited by 59% of this .