DATAGUISE WHITE PAPER SECURING HADOOP

Transcription

DATAGUISE WHITE PAPERSECURING HADOOP:DISCOVERING AND SECURING SENSITIVE DATAIN HADOOP DATA STORESOVERVIEW:The rapid expansion of corporate data being transferred or collected and stored in HadoopHDFS is creating a critical problem for Chief Information Security Officers, complianceprofessionals, and IT staff responsible for data management and security. Frequently, thepeople responsible for securing corporate data are not even aware that Hadoop has beeninstalled and in use within the company. Dataguise DG for Hadoop scans data stores,locates sensitive content and then applies remedies such as data masking and encryption toensure compliance with industry regulations, such as HIPAA and PCI, as well as internalcorporate data governance policies.

BIG DATA EXPLOSIONPetabytes of data - structured, semi-structured and unstructured - areaccumulating and propagating across your business. A good portion ofthis data comes from external sources and from customer interactionchannels such as web sites, call center records, log files, Facebook,and Twitter. To mine these large volumes and varieties of data in acost efficient way, companies are adopting new technologies such asHadoop.THE AGEOF BIGDATA HASARRIVED.TRADITIONAL DATA WAREHOUSESWhat about traditional data warehouses? While they offer manyadvantages for decision support, the drawbacks of traditional datawarehouses are that they are hugely expensive and require that theschema be decided well in advance; taking away the flexibility ofdeciding how to slice and dice data as new methods of analysisemerge. Because Hadoop can be set up and expanded rapidly usingcommodity hardware, and schema may be defined at the time ofreading the data, it is becoming the new platform of choice forprocessing and analyzing big data.Even companies thatthink they don’t haveHadoop are surprisedto learn it has beendownloaded and is inuse with sensitive“BUT WE DON’T HAVE HADOOP”Chief Information Security Officers (CISOs), CIOs and others involvedin corporate information security will often say that their organizationdoes not have any Hadoop clusters. They rely on processes in placeto ensure that any software installed in the enterprise has gonethrough an extensive approval and procurement process before it isimplemented. They are therefore often surprised to learn that Hadoopis already installed and running and this happens because Hadoop isa free download and is available directly from the Apache website orfrom one of the leading distributors of Hadoop such as Cloudera,MapR, IBM (InfoSphere BigInsights), and Hortonworks. It is verysimple for any number of employees to create a Hadoop installationand be up and running very quickly. Even if Hadoop is only being usedin a sandbox (isolated from the production systems) or in a testenvironment, corporate data being stored in Hadoop must still adhereto the same rigorous corporate standards in place for the datainfrastructure or the company risks the consequences of failing acompliance audit, or even worse – a data breach.FINDING SENSITIVE DATA IN HADOOPSensitive data includes items such as taxpayer IDs, employee names,addresses, credit card numbers, and many more. Data theftprevention, sound governance practices and the need to satisfycorporate data.DO YOUKNOWWHEREYOUR RISKEXPOSURESARE?2

compliance requirements for industry regulations such as PCI, HIPAA,and PII make it imperative that organizations implement the necessaryprocesses to identify and protect sensitive information. The first step insecuring Hadoop is to search for and locate sensitive information, anddetermine the volume and types of data that are at risk. The challengeis that many search products are designed to work only with structureddata using basic regular expressions. Scanning data in Hadooprequires a sophisticated discovery tool that can scan large volumes ofboth structured and unstructured data, and do it rapidly."It doesn't take aclairvoyant – or in this case,a research analyst – to seethat 'big data' is becomingTHE NEED TO PROTECT SENSITIVE DATAOnce it has been determined that Hadoop is in the corporateinfrastructure and contains sensitive information, CIOs and CISOsshould be very nervous about potential exposure. Because Hadoophas been mostly used in Social Media companies so far, and only nowis being used by financial, health-care, and other security-consciousenterprises, options for data protection in Hadoop have been limited.Whereas there are numerous options for legacy databases andstructured data stores, Hadoop poses a new challenge for companiesthat need to maintain compliance. The same type of “strong” protectionin use for traditional data stores is needed for the Hadoop environment,as well.(if it isn't already, perhaps)a major buzzword insecurity circles. Much of thesecuring of big data willneed to be handled bythoroughly understandingthe data and its usagepatterns. Having the abilityto identify, control accessto, and – where possible –mask sensitive data in bigdata environments basedon policy is an importantCHOOSE MASKING OR ENCRYPTIONWhether sensitive data was stored in Hadoop intentionally orunintentionally, once it is discovered and documented there are twomain approaches to remediation: encryption and masking. Encryptionis typically used when access to the sensitive content is needed foranalytical purposes. The encrypted data can be decrypted by anauthorized user at the time of use. Masking is used to protect sensitivedata when there is no need for the actual sensitive content, as maskingreplaces sensitive data with realistic (but not real) data. Optionally,consistency may be maintained to retain the statistical distribution ofdata. Although there are some similarities between data masking andencryption, they are different in usage, technology and deploymentstrategies. Encryption can conceal private data and decrypt it based onencryption keys. Data masking on the other hand conceals private databut masked values cannot be reversed.THE DATAGUISE SOLUTIONDataguise specializes in sensitive data protection in large repositories.We began with relational databases, expanded to shared file systemspart of the overallapproach."– Ramon KrikkenResearch VP, Security andRisk ManagementStrategies AnalystGartnerDO YOUHAVE THETOOLS TOPROTECTYOURDATA?3

and Microsoft SharePoint, and now we are bringing our enterprise-class expertise to secure Hadoop.Bringing together experienced technology professionals from database,security, and search specialties, we combine the best of these disciplinesDISCOVERto secure Hadoop in the enterprise. Dataguise’s core product for Hadoop– DG for Hadoop – combines sensitive data discovery, user and event4Rreports, and options for both encryption and masking to provide the most comprehensive data securitysolution in the market today.INTRODUCING DGSECUREThe purpose of DG for Hadoop is simple yet crucial – to detect and protect sensitive data in Hadoopimplementations. As part of the Dataguise DgSecure product line, DG for Hadoop is the ideal solutionto help ensure that compliance standards are met while reaping the benefits of using Hadoop tomanage large amounts of structured and unstructured data. To accommodate various usage patterns,DG for Hadoop supports detection and protection at the source before moving data to Hadoop, in flightwhile moving data to Hadoop, and at rest after moving unprotected data to HDFS. Just in timeprotection is provided through incremental scan of newly added data in HDFS. Once sensitive data islocated and identified, either masking or encryption can be chosen as a protection method, based onspecific requirements of the organization or the purpose of storing and managing the data.HOW IT WORKSDG for Hadoop gives the Chief Security office and other entities that have the responsibility ofconforming to industry regulations and Corporate Governance, Risk, and Compliance (GRC), the abilityto define policies. These policies define what data is considered sensitive, based on a combination ofpre-built data types and custom data types that the user can add. The policies also allow the sensitivedata types to be grouped to be in alignment with regulations, and allow remedial actions to be specified.This provides guidance for those handling the data on what to do. All of the details of the datarepositories and the actions taken are fully logged, so, in the backend of this process, the ChiefSecurity office and others can track risk profiles through the dashboard which also provides actionabledetails and use reports to audit actions to ensure that the right people have the right access to the rightdata. The first step is to search for any sensitive data in Hadoop data storeslocated on premises or in the cloud. A user operating under the corporate policyguidance (PCI, PII, HIPAA etc.) creates a task definition against one or moreDEFINEfiles and directories or combinations of them and executes the job. DG forPOLICYHadoop scans all the targeted data stores to find data that meet those criteriaand take appropriate, task-specified remediation actions.Additional search features: Custom data types – add custom expressions to augment the built-in capabilitiesColumnar searches – to allow for searches of structured dataIncremental scans – to quickly search only the new data that came in since the last scan4

ANALYZEAfter collecting the information about sensitive data, DG for Hadoop delivers users risk assessmentanalytics in the form of easy-to-interpret graphical summaries. Users can then evaluate theircompliance exposure profiles and decide on the most appropriate remediation policies to implement.REMEDIATIONDG for Hadoop provides three main options for remediation –1) Notification (Search only) – whenever new data have been ingested into Hadoop, DG forHadoop processes the content and informs the designated users of the presence of sensitivedata2) Search and mask – As part of locating sensitive data, masking can also be executed based on apredefined policy – in-flight as data gets into Hadoop or within in Hadoop once the data is there3) Search and Encrypt – As part of locating sensitive data, DG for Hadoop can optionally encryptthe entire row or just the specific fields – in-flight or in Hadoop HDFSDASHBOARD AND REPORTINGREPORTDG for Hadoop provides top level summary (directory), and in-depth (file level)detailed information about where sensitive content resides, and remediation method(s) applied,highlighting the gaps in protection and providing actionable data for appropriate follow ups.DG FOR HADOOP – SOLUTION FOR ALL USAGE PATTERNS5

BENEFITSDG for Hadoop provides unique and important solution for enabling data security in Hadoop. Tangiblebenefits include the ability to conform to regulatory requirements, prevent the risk of failing acompliance audit, and ensuring that valuable corporate data is safe from security breaches.Implementing DG for Hadoop enables organizations to:Simplify Data Compliance Management in HadoopEliminates the need to build custom applications or patch together disparate tools tosearch for and protect sensitive data.Improve Operational EfficienciesLess staff time is required to administer custom Hadoop security controls, customreports or to move data to databases or other data stores for remediation.Reduce Regulatory Compliance CostsOne tool can now take care of tasks that previously took multiple software productsand costly consulting hours to achieve.Automate Compliance Assessment and EnforcementDG Hadoop dashboard and reports summarize sensitive data content with actionabledetails of exposure risks.CONCLUSIONProtecting sensitive data in Hadoop is critical as volumes of data continue to expand in the enterpriseand Hadoop continues to become technology of choice at increasing rate. Effective data protectionstrategy must start with finding all of the sensitive data, putting in place the proper remediation policies,and monitoring data flow to ensure that the established procedures are followed. DG for Hadoop is theleading solution to secure Hadoop effectively, quickly and to ensure adherence to sound datagovernance practices across the entire big data environment.ABOUT DATAGUISEDataguise helps organizations safely leverage their enterprise data with a comprehensive risk-baseddata protection solution. By automatically locating sensitive data, transparently protecting it with highperformance masking, encryption or quarantine, and providing enterprise security intelligence tomanagers responsible for regulatory compliance and governance, Dataguise improves data riskmanagement and operational efficiencies while reducing regulatory compliance costs.For more information, call 510-824-1036 or visit www.dataguise.com. Dataguise, Inc. 2012. All rights reserved.6

MapR, IBM (InfoSphere BigInsights), and Hortonworks. It is very simple for any number of employees to create a Hadoop installation and be up and running very quickly. Even if Hadoop is only being used in a sandbox (isolated from the production systems) or in a test environment