Big Data Fundamentals - Pearsoncmg

Transcription

Big Data Fundamentals

This page intentionally left blank

Big Data FundamentalsConcepts, Drivers & TechniquesThomas Erl,Wajid Khattak,and Paul BuhlerBOSTON COLUMBUS INDIANAPOLIS NEW YORK SAN FRANCISCOAMSTERDAM CAPE TOWN DUBAI LONDON MADRID MILAN MUNICHPARIS MONTREAL TORONTO DELHI MEXICO CITY SAO PAULOSIDNEY HONG KONG SEOUL SINGAPORE TAIPEI TOKYO

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademarkclaim, the designations have been printed with initial capital letters or inall capitals.The authors and publisher have taken care in the preparation of thisbook, but make no expressed or implied warranty of any kind andassume no responsibility for errors or omissions. No liability is assumedfor incidental or consequential damages in connection with or arisingout of the use of the information or programs contained herein.For information about buying this title in bulk quantities, or for specialsales opportunities (which may include electronic versions; custom coverdesigns; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate salesdepartment at corpsales@pearsoned.com or (800) 382-3419.Editor-in-ChiefMark TaubSenior AcquisitionsEditorTrina MacDonaldManaging EditorKristy HartSenior Project EditorBetsy GratnerCopyeditorsNatalie GittAlexandra KropovaFor government sales inquiries, please contactgovernmentsales@pearsoned.com.Senior IndexerCheryl LenserFor questions about sales outside the U.S., please xandra KropovaDebbie WilliamsVisit us on the Web: informit.comLibrary of Congress Control Number: 2015953680Copyright 2016 Arcitura Education Inc.Publishing CoordinatorOlivia BasegioAll rights reserved. Printed in the United States of America. Thispublication is protected by copyright, and permission must be obtainedfrom the publisher prior to any prohibited reproduction, storage in aretrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, request forms and the appropriate contactswithin the Pearson Education Global Rights & Permissions Department,please visit www.pearsoned.com/permissions/.Cover DesignerThomas ErlISBN-13: 978-0-13-429107-9ISBN-10: 0-13-429107-7PhotosThomas ErlText printed in the United States on recycled paper at RR Donnelley inCrawfordsville, Indiana.Educational ContentDevelopmentArcitura Education Inc.First printing: December 2015CompositorBumpy DesignGraphicsJasper Paladino

To my family and friends.—Thomas ErlI dedicate this book to my daughters Hadia and Areesha,my wife Natasha, and my parents.—Wajid KhattakI thank my wife and family for their patience and forputting up with my busyness over the years.I appreciate all the students and colleagues I have had theprivilege of teaching and learning from.John 3:16, 2 Peter 1:5-8—Paul Buhler, PhD

This page intentionally left blank

Contents at a GlancePART I: THE FUNDAMENTALS OF BIG DATACHAPTER 1: Understanding Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3CHAPTER 2: Business Motivations and Drivers for Big Data Adoption . . . . . . . . . . . . .29CHAPTER 3: Big Data Adoption and Planning Considerations . . . . . . . . . . . . . . . . . . .47CHAPTER 4: Enterprise Technologies and Big Data Business Intelligence . . . . . . . . . 77PART II: STORING AND ANALYZING BIG DATACHAPTER 5: Big Data Storage Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91CHAPTER 6: Big Data Processing Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119CHAPTER 7: Big Data Storage Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .145CHAPTER 8: Big Data Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .181APPENDIX A: Case Study Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207About the Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213

This page intentionally left blank

ContentsAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiReader Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiiPART I: THE FUNDAMENTALS OF BIG DATAC HAPTER 1: Understanding Big Data . . . . . . . . . . . . . . . . . . . 3Concepts and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6Descriptive Analytics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Diagnostic Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Prescriptive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Business Intelligence (BI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12Key Performance Indicators (KPI) . . . . . . . . . . . . . . . . . . . . . . . . .12Big Data Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14Variety. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15Veracity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16Value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16Different Types of Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18Unstructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19Semi-structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Case Study Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Technical Infrastructure and Automation Environment. . . . . . . . . .21Business Goals and Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . 22

xContentsCase Study Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Identifying Data Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . 26Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Veracity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Identifying Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27C HAPTER 2 : Business Motivations and Driversfor Big Data Adoption. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29Marketplace Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30Business Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Business Process Management . . . . . . . . . . . . . . . . . . . . . . . . 36Information and Communications Technology. . . . . . . . . . . . . . 37Data Analytics and Data Science . . . . . . . . . . . . . . . . . . . . . . . . .37Digitization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Affordable Technology and Commodity Hardware . . . . . . . . . . . 38Social Media. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Hyper-Connected Communities and Devices . . . . . . . . . . . . . . . 40Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40Internet of Everything (IoE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42Case Study Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43C HAPTER 3 : Big Data Adoption and PlanningConsiderations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Organization Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Data Procurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Provenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

ContentsxiLimited Realtime Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52Distinct Performance Challenges. . . . . . . . . . . . . . . . . . . . . . . . 53Distinct Governance Requirements . . . . . . . . . . . . . . . . . . . . . . 53Distinct Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Big Data Analytics Lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . 55Business Case Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Data Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57Data Acquisition and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Data Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60Data Validation and Cleansing . . . . . . . . . . . . . . . . . . . . . . . . . . . .62Data Aggregation and Representation. . . . . . . . . . . . . . . . . . . . . 64Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Utilization of Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69Case Study Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Big Data Analytics Lifecycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73Business Case Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73Data Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74Data Acquisition and Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . .74Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74Data Validation and Cleansing . . . . . . . . . . . . . . . . . . . . . . . . . . . .75Data Aggregation and Representation. . . . . . . . . . . . . . . . . . . . . .75Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76Utilization of Analysis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76C HAPTER 4 : Enterprise Technologies andBig Data Business Intelligence . . . . . . . . . . . . . . . . . . . . . . 77Online Transaction Processing (OLTP) . . . . . . . . . . . . . . . . . . . 78Online Analytical Processing (OLAP) . . . . . . . . . . . . . . . . . . . . 79Extract Transform Load (ETL) . . . . . . . . . . . . . . . . . . . . . . . . . . 79

xiiContentsData Warehouses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Data Marts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Traditional BI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82Ad-hoc Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82Dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82Big Data BI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Traditional Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84Data Visualization for Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . 85Case Study Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86Enterprise Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86Big Data Business Intelligence. . . . . . . . . . . . . . . . . . . . . . . . . . . .87PART II: STORING AND ANALYZING BIG DATAC HAPTER 5 : Big Data Storage Concepts . . . . . . . . . . . . . . . . 91Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93File Systems and Distributed File Systems . . . . . . . . . . . . . . . . 93NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94Sharding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97Mas

decision to adopt Big Data must take into account many business and technol-ogy considerations. This underscores the fact that Big Data opens an enterprise to external data infl uences that must be governed and managed. Likewise, the Big Data analytics lifecycle