Big Data Management And Security - Chapters Site - Home

Transcription

Big Data Management and SecurityAudit Concerns and Business RisksTami FrankenfieldSr. Director, Analytics and Enterprise DataMercury Insurance

What is Big Data?Velocity Volume Variety Value

The Big Data JourneyBig Data is the next step in the evolution of analytics to answer critical and often highly complex business questions. However, thatjourney seldom starts with technology and requires a broad approach to realize the desired value.“Big Data”“Fast Data”Analyze streams of real-time data,identify significant events, andalert other systemsModeling and PredictingLeverage information forforecasting and predictive purposesBusiness IntelligenceFocus on what happenedand more importantly, why ithappenedReportingReport on standardbusiness processesDataManagementEstablish initialprocesses andstandardsLeverage large volumes of multi-structured datafor advanced data mining and predictivepurposes

Implications of Big Data?Enterprises face the challenge and opportunity of storing and analyzing Big Data. Handling more than 10 TB of data Data with a changing structure or no structure at all Very high throughput systems: for example, in globallypopular websites with millions of concurrent users andthousands of queries per second Business requirements that differ from the relationaldatabase model: for example, swapping ACID (Atomicity,Consistency, Isolation, Durability) for BASE (BasicallyAvailable, Soft State, Eventually Consistent) Processing of machine learning queries that are inefficient orimpossible to express using SQL“Shift thinking from theold world where datawas scarce, to a worldwhere business leadersdemonstrate datafluency” - Forrester“Information governancefocus needs to shift awayfrom more concrete,black and white issuescentered on ‘truth’,toward more fluid shadesof gray centered on‘trust.’ ” - Gartner“Enterprises canleverage the datainflux to glean newinsights – Big Datarepresents a largelyuntapped source ofcustomer, product,and marketintelligence” – IBMCIO Study

Taking a Look at the Big Data EcosystemBig Data Is supported and moved forward by a number of capabilities throughout the ecosystem. In many cases, vendors and resourcesplay multiple roles and are continuing to evolve their technologies and talent to meet the changing market demands.Big DataAnalyticsBI/DataVisualizationBig DataIntegrationAppliancesBig Data Fileand DatabaseManagementStreamProcessingand AnalysisBig Data Ecosystem

Big Data Storage and ManagementAn Hadoop based solution is designed to leverage distributed storage and a parallel processing framework (MapReduce) for addressingthe big data problem. Hadoop is an Apache foundation open source project.Apache Hadoop EcosystemUI FrameworkSDK(SQL)MapReduce(Job Scheduling and Processing Engine)HadoopHBase (Distributed DB)HDFS (Storage Layer)FLUME, SQOOP (Data(Coordination)OOZIE(Workflow & Scheduling)HIVEIntegration)ZOOKEEPERPIG(Data Flow)ConnectorsTraditional DataWarehouse (DW)TraditionalDatabasesAdvanced AnalyticsToolsMPP or In-memorysolutions

Big Data Storage and ManagementThe need for Big Data storage and management has resulted in a wide array of solutions spanning from advanced relational databases tonon-relational databases and file systems. The choice of the solution is primarily dictated by the use case and the underlying data type.Non-Relational Databaseshave been developed toaddress the need for semistructured andunstructured data storageand management.Hadoop HDFS is a widelyused key-value storedesigned for Big Dataprocessing.Relational Databasesare evolving to addressthe need for structuredBig Data storage andmanagement.

Big Data Security ScopeBig Data security should address four main requirements – perimeter security and authentication, authorization and access, dataprotection, and audit and reporting. Centralized administration and coordinated enforcement of security policies should be considered.Applicability Across EnvironmentsBig Data SecurityProductionPerimeter Security &AuthenticationRequired for guarding access to the system, itsdata and services. Authentication makes sure theuser is who he claims to be. Two levels ofAuthentication need to be in place – perimeterand intra-cluster.Authorization & AccessRequired to manage access and control overdata, resources and services. Authorization canbe enforced at varying levels of granularity and incompliance with existing enterprise securitystandards.Data ProtectionAudit & ReportingRequired to control unauthorized access tosensitive data either while at rest or in motion.Data protection should be considered at field, fileand network level and appropriate methodsshould be adoptedRequired to maintain and report activity on thesystem. Auditing is necessary for managingsecurity compliance and other requirements likesecurity forensics.ResearchDevelopmentKerberos LDAP/ Active Directory andBU Security ToolIntegration** File & DirectoryPermissions, ACL Role Based Access Controls Encryption at Restand in Motion Data Masking & TokenizationAudit Data Audit Reporting 8

Big Data Security, Access, Control and EncryptionIntegration of Security, Access, Control and Encryption across major components of the Big Data landscape.ColumnFolderTableFolderSecurity/Access Control UIFile Ability to define rolesFileDatabase Ability to add/remove users Ability to assign roles to users Ability to scale across platformsTableDatabaseCFColumnCFTableLDAP/ACTIVE DirectoryRestricted AccessEncrypted Data

Security, Access, Control and Encryption DetailsGuidelines & ConsiderationsEncryption / Anonymization Data should be natively encrypted during ingestion of data into Hadoop (regardless of the data getting loaded into HDFS/Hive/HBase) Encryption Key management should be maintained at a Hadoop Admin level, there by the sanctity of the Encryption is maintainedproperlyLevels of granularity in relation to data access and security HDFS: Folder and File level access control Hive: Table and Column level access control HBase: Table and Column Family level access controlSecurity implementation protocol Security/Data Access controls should be maintained at the lowest level of details within a Hadoop cluster Overhead of having Security/Data Access should be minimum on any CRUD operationManageability / scalability GUI to Create/Maintain Roles/Users etc. to enable security or data access for all areas of Hadoop Ability to export/share the information held in the GUI across other applications/platforms Same GUI interface or application should be able to scale across multiple platforms (Hadoop and Non-Hadoop)Key Terms DefinedIn Hadoop, Kerberos currently provides two aspects of security: Authentication – This feature can be enabled by mapping theUNIX level Kerberos IDs to that of Hadoop. In a matureenvironment, Kerberos is linked/mapped to Active Directory orLDAP system of an organization. Maintenance of mapping istypically complicated Authorization – Mapping done in the Authentication level isleveraged by the Authorization and the users can be authorizedto access data at the HDFS folder levelPoint of ViewLarge organizations should adopt a more scalable solution withfiner grains of access control and encryption / anonymization ofdataSelect a tool which is architecturally highly scalable and consists ofthe following features: Levels of granularity in relation to data access and security Security implementation protocol Manageability / scalability Encryption / Anonymization

Multi-Tenancy DetailsGuidelines & ConsiderationsMulti-tenancy in Hadoop primarily need to address two key things:1. Resource sharing between multiple applications and making sure none of the application impacts the cluster because of heavy usage2. Data security/Auditing - users and applications of one application should not be able to access HDFS data of other apps Vendors vary according to their support for a POSIX file system: MapR provides it by default, IBM BigInsights and Pivotal providePOSIX compliant add on packages (General Parallel File System and Isilon, respectively) POSIX compliant file systems simplify initial migration into the Hadoop distribution. The relative advantage over other Hadoopdistribution decreases as the load approaches steady state Access Control Lists (ACLs) facilitate multi-tenancy by ensuring only certain groups and users can run jobs With the evolution of YARN, the capacity scheduler can be used to set a minimum guaranteed resource per application as a % ofRAM available (% of CPU will be possible in future)‒ This is a more efficient way of sharing resources between different groups within an organization. Before YARN, resources inHadoop are available only as a number of map reduce slots available. So although multi-tenancy was possible, it was not veryefficientKey Terms Defined A POSIX compliant filesystem - provides high availability ofHadoop by having name node data totally distributed removingthe single point of failure efficiently. Enables HDFS federation MapR volume – a collection of directories that contains data fora single business unit, application or user group. Policies maybe applied to a volume to enforce security or ensure availability YARN decouples workload management from resourcemanagement, enabling multi tenancyPoint of View Multi-tenancy begins with a multi-user-capable platform. Further,in a clustered computing environment, it involves multi-tenancyat a data (file system) level, a workload (jobs and tasks) level,and a systems (node and host) level The recent advancements of YARN and Hadoop 2.0 are quicklyclosing the feature gap between proprietary file systems andHDFS

Big Data Security Options and RecommendationsBackgroundRecommendations Hadoop supports Kerberos as a third partyAuthentication mechanismUsed for Intra Services authentication (Name NodeData Node, Task Tracker, Oozie etc),End User-Services(Hue, File browser, cli etc)Ticket based authentication and Operates within arealm(hadoop cluster) and inter-realms as wellUsers and services rely on a third party Kerberosserver - to authenticate each otherOffers one way, two way Trust optionsOffers LDAP integration options HDFS Permission bitsHDFS ACLsYARN ACLsAccess control for Hive/HBase with Apache Knox Hadoop supports Data Encryption in Motion –HTTPS(web clients, REST), RPC Encryption (API, JavaFrameworks), Data Transfer protocol (Name NodeData Node)No Options available for Data Encryption at Rest atthe moment Perimeter Security &Authentication Authorization &AccessData Protection Audit & Reporting HDFS Auditing – Centrally available via Ranger(XAsecure)Perimeter Access – Available at KnoxJob Auditing – Admin console, Job Tracker Logs Kerberos is highly recommended as it supports authenticationmechanisms throughout the clusterUse Apache Knox for perimeter authentication to Hive, HDFS, HBase etcManage user groups in POSIX layer (initially atleast). Resolving usergroups from LDAP involves corporate IT dependencyHadoop security implementation is not easy. Take a phased approach. Configure Kerberos for Hadoop Provision initial set of users/user groups in POSIX layer Integrate LDAP/Single-Sign On (Kerberos, Hue, Ambari) Prepare final list of users/user groups and provision the users onPOSIX layer matching ldap principals Optionally Configure Knox gateway for perimeter authenticationfor external systems (with LDAP or SSO)Authorization control with Apache Knox for column/row accessrestrictions for usersOptionally configure Accumulo if cell level restrictions are required forHBase/Hive Configure Data in Motion wire encryptionAgree on data encryption algorithms for data which needs to beexported outside lake within Zurich networkOptionally implement ‘Data at Rest’ Encryption Monitor your YARN capacity – Key to enhance multi-tenancySet-up Security compliance processSet-up user administrative process/auditing12

Hadoop security implementation is not easy. Take a phased approach. Configure Kerberos for Hadoop Provision initial set of users/user groups in POSIX layer Integrate LDAP/Single-Sign On (Kerberos, Hue, Ambari) Prepare final list of users/user groups and provision the users on POSIX layer matching ldap principals Optionally Configure Knox gateway for perimeter authentication for external .