Best Practices For Hadoop: Title A Guide From SAS Customers

Transcription

WHITE PAPERBest Practices for Hadoop:TitleA Guide From SAS Customers

iiContentsIntroduction. 1Planning. 1Tip 1: Know What Result You Want Before You Start.1Tip 2: Identify and Onboard Users.2Tip 3: Solidify Support and Engage Active Participants.2Assessing and Mapping the Data. 2Supporting the Infrastructure. 4Educating and Training Users. 4Migrating Data. 5Converting Programs. 6Proving the Value. 6Moving to Production. 7Post-Implementation: Putting Strategies Into Practice. 8Enabling Data for Use.8Securing the Environment.8Dealing With Users Who Are Resistant to Change.9Entering POV/Production Phase in a Multitenant Hadoop Environment10Conclusion.11Learn More.11

1IntroductionAs Hadoop becomes commonplace, many organizations have come to learn what “theHadoop skills gap” really means. Often, organizations don’t have the resources or skillsneeded to effectively adopt and manage Hadoop. The market clearly understands thebenefits of Hadoop – the struggle now occurs at a tactical level that involves how to gofrom initial planning to production.While every organization is different, and their approaches to Hadoop vary, any organization can learn from the trials, tribulations and accomplishments of others. The journeyto begin using Hadoop involves technical, procedural and process challenges. Butthose who plan well can minimize disruptions and realize significant gains.Many organizations have used SAS to help ease the transition to Hadoop. This paperhighlights best practices identified by SAS customers using Hadoop. It serves as aguide for others looking to make Hadoop part of their organization.PlanningTip 1: Know What Result You Want Before You StartThe first step toward success with Hadoop is knowing what objective your organizationhopes to achieve by deployment. Below are six common objectives SAS customershave identified, along with related considerations: The desire to use Hadoop to process data as fast as current storage mechanismsand processing techniques allow. A common proof-of-value (POV) challenge is toidentify where more transformative processes need to be applied to both data structures and processing. The need to use Hadoop as warm storage. This includes migrating data frombackup or external systems into a central store. A good example would be movingdata from mainframe backups to Hadoop storage. The business driver here is dataaccessibility. The ability to keep all your data in one Hadoop environment. Doing this involvesmoving data from various sources into Hadoop and then using Hadoop as thesource for data access. Being able to process against the data stored in Hadoop. If the data moves intoHadoop then the data processing is expected to move as well. Some users haveworked on transforming SAS code into scoring models and DS2 execution forin-database processing. Reduced costs. Be sure to ask what overall cost savings you’re expecting withHadoop versus your current data storage and access methods. Many users find thatbigger savings come from the results of using all the data stored in Hadoop. Desire to get ahead of competitors. There are disruptive advantages to the analytically useful information you can get out of Hadoop. Consider this benefit as youmake your plans.

2Tip 2: Identify and Onboard UsersTo succeed with Hadoop, it’s important to follow appropriate processes to identifymembers of the users community and give them access to the Hadoop environment.User identification and onboarding processes need to be in place ahead of time. Manyorganizations have found these practices helpful: Identify the data first. To identify the user you must also identify the data he or sheneeds to access in Hadoop. This might include data stored in SAS data sets, relational database management systems (RDBMS) and other data storage locations.Rather than requiring users to provide details, SAS can help you do a programmaticassessment to identify data usage. Create a secure environment. Once you have identified the data that’s needed, youmust create a secure environment within the Hadoop ecosystem. There are someconcerns about the security of data fields, data at rest and data extracted fromHadoop. It will save time in the long run if you have a detailed plan and a team incharge of implementation before onboarding users or data.Tip 3: Solidify Support and Engage Active ParticipantsMany organizations fail with Hadoop due to undersupported or understaffed projects.SAS customers have identified several best practices that have helped them be moreeffective with these efforts: Executive sponsorship. You need both executive sponsorship and technical leadership. The technical lead should be at a high enough level in the organization toengage both IT and the business unit as contributors to the decision-makingprocess. Time commitments. It’s important to have time commitments from the userscommunity that has a vested interest in your SAS and Hadoop implementation.These users should understand their data and know how SAS interacts with the datayou intend to put into Hadoop. A vision. Set a vision, develop a plan, assign tasks and create a timeline based onyour success criteria. You need to put some structure around your punch list to avoidinevitable scope creep.Assessing and Mapping the DataData assessment and mapping might be one of the least funded and most problematicissues for Hadoop. Don’t underestimate the importance of developing a comprehensive data strategy before using Hadoop. Below are some questions you should ask, andstrategies you can follow, to help with the data identification and ingestion process: Do you plan to load SAS data into Hadoop? If so, consider the following example:оо Is a 10 numeric column, 500,000 row SAS data set considered “big” data? ForHadoop, the answer is no. Your SAS data set, which is 40 million rows, would berepresented as a single data split in the Hadoop environment. (In other words,single-threaded processing occurs on one piece of data in Hadoop.) Given thisfact, a SAS process running against this SAS data set would be the bestperformer. Note that Hadoop data splits are 128 million rows and up; so yourSAS data set in Hadoop should be many multiples of the data split size beforeyou consider Hadoop for data storage and processing.The market clearlyunderstands thebenefits of Hadoop –the struggle nowoccurs at a tacticallevel that involves howto go from initialplanning to production.

3оо How do you plan on processing the SAS data you’ve loaded into Hadoop? If theprocess is read only, evaluate the type of Hive table you have created. Thiswould include column types, Hadoop storage format and access patterns. Are you planning to load data from a DBMS into Hadoop? If so, and if the DBMSuses a complex data model, consider how Hadoop is going to interact with thatmodel. Otherwise your ability to port and efficiently process using HiveQL might notwork; in that case, you might need to convert the data model into one that can beprocessed in Hadoop. The way you map the needed data should mirror theprocesses you plan to run on it. Is data cleansing part of your assessment? If not, it should be. As you load data fromexternal sources into Hadoop, consider adding cleansing operations as part of theprocess. With Hadoop, you will find it’s much easier to cleanse on the way in than tryto change data after it’s in place. How are you planning to refresh the data you’re loading? Incremental refreshesmight be difficult to implement given that Hadoop has yet to become fully ACID(atomicity, consistency, isolation and durability) compliant. How are you planning to access the data in Hadoop? Hadoop is at its best whendata is processed in large chunks, not individual records. If you primarily need toprocess individual records, Hadoop is probably not the ideal platform to use. What type of storage format do you plan to use for your Hadoop data? Althoughit’s widely used, text might not be the best option from a performance standpoint. Ifyour organization plans on accessing the same data using components such as Hiveand Impala at the same time, Apache ORC might be a more sensible choice. IfImpala is your only data access tool, Parquet might provide the best performance. Do you plan to compress your data in Hadoop? Several compression options areavailable: Evaluate the pros and cons of each based on your needs before you makeyour final decision. And remember that some storage formats like ORC already havebuilt-in compression; knowing this might simplify your decision-making process. How are you planning to secure your data? By default, Hadoop is a nonsecure environment. Keep in mind that too much security can pose performance issues (thistends to be the case with Apache Knox, for example). Have you considered the encryption zone? Do you plan on creating pockets of datafor specific users or divisions within your organization? Do you plan to implement a data archival process to phase out old data whileingesting new data? Where will the old data be archived? How “old” is old? Have you thought about disaster recovery scenarios? Can your Hadoop data poolbe rebuilt using other data sources? If not, do you have a backup/recovery strategyin place?

4Supporting the InfrastructureWith any Hadoop implementation, you’ll need to consider how much investment willbe needed for hardware, networking bandwidth and software. But SAS customers havefound that having access to dedicated experts and administrators is also vital. Makesure you have the following experts in place to support the infrastructure: A SAS administrator – one who understands SAS system requirements for Hadoop,SAS metadata, SAS In-Database, SAS/ACCESS software, performance and tuning. A Hadoop administrator – one who understands Hadoop security, SQL, Hadoopcluster performance, tuning and monitoring. Network/security expertise – someone who can assist with user security concernsand configurations as a precursor to enabling security in Hadoop. For example, thisexpert could address Kerberos, interactions between users and Hadoop, securityguidance for establishing Hadoop best practices, and help with Kerberos ticketgeneration or troubleshooting. Hardware/operating system expertise – someone who can help with UNIX or Linuxissues, options, installation and patches to meet both SAS and Hadoop systemrequirements. A technical project manager – someone who can provide technical leadership forusers. This should include securing resources from the above experts, and providinguser support. SAS customers have seen better results in cases where the projectmanager had a deep understanding of the data and processing goals.Educating and Training UsersSAS customers have employed several different types of user training. Following arethe top three: Functional training for experienced SAS programmers whose data is moving toHadoop. It’s important to provide this training at the right time. Our customers havedone best when the training was followed closely by execution against the new environment. In cases where there was a large gap of time between the educationprocess and execution, results were not as good. Best practices for users as part of the education process. These practices includeSQL optimization, SAS execution strategies and coding efficiencies specific tocertain user environments. SAS customers have obtained good results by injectingbest practices into user executions against SAS and Hadoop environments. Peer-to-peer training during the knowledge transfer process. In this scenario, agroup of power users experiments with implementations in Hadoop. These experiments then result in best practices and/or mentoring for other users in the samedepartment or organization.The problem with datamigration is notloading data intoHadoop; it’s determining how you’regoing to manage thedata after the fact.

5Migrating DataThe problem with data migration is not loading data into Hadoop; it’s determining howyou’re going to manage the data after the fact. To overcome these issues, many organizations have developed a level of sophistication around data organization within theirHadoop environments. This includes having processes that help zone, stratify or layerthe Hadoop data store to create an environment that assures data access. Migratingdata into this environment is just the start of the journey to having accessible, usabledata. Table 1 shows examples of how organizations have handled data ingestion issueswith Hadoop.Source DataRequirementIn HadoopProcessing PlanSASMigrate to Hadoop.Store in SAS ScalablePerformance Data Engine format,SASHDAT, Hive table and HDFSfile (TXT).Access this data from SAS,which requires all SASmetadata to be storedwith the data. This wouldinclude formats, informats,labels and so on.DBMSMigrate to Hadoop and developan update strategy to keepmigrated data current. If thedata requires transformation tobe processed in Hadoop, thenthe procedure must bepreserved. Note that transformation of tabular data into aHadoop consumable formmight be required for complexdata models.Recognize that data is to bejoined to other tables migratedfrom the DBMS, and considerstorage size in Hadoop, alongwith resource requirements. Alsoconsider Apache ORC storagetype, data partitioning and otherHadoop constructs.Perform scoring orprocessing in Hadoopwith the potential finalprocessing on a SASserver.StreamCapture web log or other rawdata in Hadoop. Build a dataprocessing and organizationplan to ready the data foranalytics. Maintain original datafor auditability.Zone the data so the data ofrecord is preserved in a highlycompressed form. As data ismigrated to other zones in thecluster, it will be made availablefor user consumption.Preprocess data inHadoop to cleanse,organize and prepare thedata for analytics. Thetransformation processesare recorded andpreserved for auditability.Once the data is placed inaccessible zones, the userscommunity can processagainst it.Table 1. Requirements and processes to consider related to ingesting data intoHadoop.

6Converting ProgramsOnce the data is loaded, processed and organized in the Hadoop environment, it’stime to convert SAS programs or processes to use it. The code conversion processcould be as simple as using new LIBNAME statements or as complex as a code rewritefor in-database processing. The table below provides guidance during codeconversion.RequirementActionPotential ProsAccess the data in Hadoopfrom my SAS job.Some data might not have beenmoved into Hadoop, so codeconversion might first requirecode review. After the analysisprocess, all SAS code components whose data has beenmoved to Hadoop must point toHadoop. This is done bychanging SAS LIBNAME statements and potentially PROCSQL code as well.Time to working with Hadoopdata is shortened.Modify or write SAS code toenable it to execute inside theHadoop environment.Performance gain from in-database execution.Develop and execute ascoring model insideHadoop.Minimal SAS code change.Quick data validation of the datamigrated to Hadoop.Quick identification of performance issues.Reduced network impact outsideof the Hadoop cluster.Reduced SAS storage requiredfor job execution.Ability to score significantly largersets of data without dataextraction.Run SAS procedures inHadoop.Enable SAS procedures –options sqlgeneration dbms;Enables PROC FREQ, PROCREPORT, PROC SORT, PROCSUMMARY, PROC MEANS, PROCTABULATE and PROCTRANSPOSE to run advancedHiveQL in Hadoop.Improves performance in listedbase procedures.Table 2. Program conversion considerations.Proving the ValueSAS has worked on POVs that have specific planned activities, as well as some that aremore dynamic. Consider the following ways to add value to your POV: Many organizations have dynamic POVs where the use cases have not been fullydeveloped. They know they want to process data in Hadoop, but are not fully awareof what they want to do with the data in Hadoop afterward. Dealing with a mass ofdata and time constraints is also problematic. If this type of POV is required, considerspecific SAS processes against generated data or identified user data as a casestudy. The dynamic POV can then shift away from specific user processing to POVprocesses. This POV could deliver on requirements on tight time schedules.

7 Many SAS customers have planned POVs where data, programs, events and checkpoints have been defined. But even with careful planning, the data loading, dataorganization in Hadoop and tests have caused delays. Experience with plannedPOVs proves that it’s best to test early. For example, once the data is loaded, theinteraction with the data can be tested ahead of user programs. User programs canbe reviewed and problems can be identified before running against Hadoop. Theplan and timeline for many companies is very rigid, but the organized support andthe POV for Hadoop needs to be flexible.So, what is the best approach to complete the POV and move on? Many SAS customershave found that investing technical resources in the design and process are critical. Ifyou’re considering moving straight from a POV to production, put the investment in thePOV so you can develop processes that can be applied to production. We haveobserved that those who rush from POV to production often run into issues and timedelays that should have been identified in the POV.Moving to ProductionA production environment can be the finish line or the start of the race. Many organizations struggle with production environments because they’ve missed some fundamental concepts. Below are some best practices that SAS customers have found helpfulwhen they were ready to move to production: Consider the size of the production environment. For a multitenant Hadoop andSAS environment, how was the sizing done? What is the impact of the users on theHadoop environment? Who is going to help identify and resolve these issues? Are the user onboarding and security processes complete? As different user groupstransfer to the production system, the onboarding process must be productionquality. The impact of the data and the processing requirements of new groupsmight require production upgrades. Have you planned for data updates? Specifically, how will data in a production environment be created and maintained? Do you have a plan for users’ data ingestionneeds and requirements? What is your disaster recovery plan? How is your data maintained at both on- andoff-site locations, and what are your disaster recovery processes

Do you plan to load SAS data into Hadoop? If so, consider the following example: о Is a 10 numeric column, 500,000 row SAS data set considered “big” data? For Hadoop, the answer is no. Your SAS data s