Using Hadoop: Best Practices

Transcription

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionUsing Hadoop: Best PracticesCasey StellaMarch 14, 2012Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionTable of ContentsIntroductionBackgroundUsing Hadoop ProfessionallyIndexingPerformanceStaying SaneTestingDebuggingState of Big Data and HadoopConclusionCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionIntroductionIHi, I’m CaseyIII work at ExplorysI work with Hadoop and the Hadoop ecosystem dailyCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionIntroductionIHi, I’m CaseyIIII work at ExplorysI work with Hadoop and the Hadoop ecosystem dailyI’m going to talk about some of the best practices that I’veseenIISome of these are common knowledgeSome of these don’t show up until you’ve been up ’til 3AMdebugging a problem.Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionIntroductionIHi, I’m CaseyIIII’m going to talk about some of the best practices that I’veseenIIII work at ExplorysI work with Hadoop and the Hadoop ecosystem dailySome of these are common knowledgeSome of these don’t show up until you’ve been up ’til 3AMdebugging a problem.These are my opinions and not necessarily the opinions of myemployer.Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionThe Lay of the Land – The BadIThere are two APIs, prefer the mapred packageIIIThe mapreduce and the mapred packagesmapred is deprecated, but still preferredHortonworks just kind of screwed upCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionThe Lay of the Land – The BadIThere are two APIs, prefer the mapred packageIIIThe mapreduce and the mapred packagesmapred is deprecated, but still preferredHortonworks just kind of screwed upIThe Pipes interface is really poorly implemented and very slowIHDFS currently has a single point of failureCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionThe Lay of the Land – The GoodIHortonworks is actively working on Map-Reduce v2IIThis means other distributed computing modelsIncluded in 0.23Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionThe Lay of the Land – The GoodIHortonworks is actively working on Map-Reduce v2IIIThis means other distributed computing modelsIncluded in 0.23HDFS is dramatically faster in 0.23IISocket communication is made more efficientSmarter checksummingCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionIndexingPerformanceIndexingIHadoop is a batch processing system, but you need realtimeaccessCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and p is a batch processing system, but you need realtimeaccessOptions areIRoll your own (Jimmy Lin talks about how one might serve upinverted indices in Chapter 3)Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and p is a batch processing system, but you need realtimeaccessOptions areIIRoll your own (Jimmy Lin talks about how one might serve upinverted indices in Chapter 3)Use an open source indexing infrastructure, like KattaCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and p is a batch processing system, but you need realtimeaccessOptions areIIIRoll your own (Jimmy Lin talks about how one might serve upinverted indices in Chapter 3)Use an open source indexing infrastructure, like KattaServe them directly from HDFS with an on-disk index akaHadoop MapFilesCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and p is a batch processing system, but you need realtimeaccessOptions areIIIIRoll your own (Jimmy Lin talks about how one might serve upinverted indices in Chapter 3)Use an open source indexing infrastructure, like KattaServe them directly from HDFS with an on-disk index akaHadoop MapFilesServe them through HBase or CassandraCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and p is a batch processing system, but you need realtimeaccessOptions areIIIIIRoll your own (Jimmy Lin talks about how one might serve upinverted indices in Chapter 3)Use an open source indexing infrastructure, like KattaServe them directly from HDFS with an on-disk index akaHadoop MapFilesServe them through HBase or CassandraIf data permits, push them to a databaseCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and p is a batch processing system, but you need realtimeaccessOptions areIIIIIIRoll your own (Jimmy Lin talks about how one might serve upinverted indices in Chapter 3)Use an open source indexing infrastructure, like KattaServe them directly from HDFS with an on-disk index akaHadoop MapFilesServe them through HBase or CassandraIf data permits, push them to a databaseKatta can serve up both Lucene indices and MapfilesCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and p is a batch processing system, but you need realtimeaccessOptions areIIIIIRoll your own (Jimmy Lin talks about how one might serve upinverted indices in Chapter 3)Use an open source indexing infrastructure, like KattaServe them directly from HDFS with an on-disk index akaHadoop MapFilesServe them through HBase or CassandraIf data permits, push them to a databaseIKatta can serve up both Lucene indices and MapfilesIIndexing is hard, be careful.Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionIndexingPerformancePerformance ConsiderationsISetup and teardown costs, so keep the HDFS block size largeIMappers, Reducers and Combiners have memory constraintsTransmission costs dearlyIIIIUse Snappy, LZO, or (soon) LZ4 compression at every phaseSerialize your objects tightly (e.g. not using Java Serialization)Key/values emitted from the map phase had better be linearwith a small constant.preferably below 1Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionIndexingPerformancePerformance ConsiderationsIStrategiesIIIIntelligent use of the combinersUse Local Aggregation in the mapper to emit a more complexvalue. (you already know this)Ensure that all components of your keys are necessary in thesorting logic. If any are not, push them into the t/mapred ommon/docs/current/vaidya.htmlCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionIndexingPerformancePerformance ConsiderationsIStrategiesIIIIntelligent use of the combinersUse Local Aggregation in the mapper to emit a more complexvalue. (you already know this)Ensure that all components of your keys are necessary in thesorting logic. If any are not, push them into the value.IProfile JobConf.setProfileEnabled(boolean)IUse Hadoop ent/mapred ommon/docs/current/vaidya.htmlCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionTestingDebuggingUnit/Integration Testing MethodologiesIIFirst off, do it.Unit test individual mappers, reducers, combiners andpartitionersIIActual unit tests. This will help debugging, I promise.Design components so that dependencies can be injected viapolymorphism when testingCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionTestingDebuggingUnit/Integration Testing MethodologiesIIFirst off, do it.Unit test individual mappers, reducers, combiners andpartitionersIIIMinimally verify that keysIIIIActual unit tests. This will help debugging, I promise.Design components so that dependencies can be injected viapolymorphism when testingCan be serialized and deserializedhashcode() is sensible (Remember: the hashcode() for enum isnot stable across different JVMs instances)compareTo() is reflexive, symmetric and jives with equals()Integration test via single user mode hadoopCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionTestingDebuggingQuality Assurance TestingIIThe output of processing large amounts of data is often largeVerify statistical propertiesCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionTestingDebuggingQuality Assurance TestingIIThe output of processing large amounts of data is often largeVerify statistical propertiesIIIIf statistical tests fit within Map Reduce, then use MRIf not, then sample the dataset with MR and verify with R,Python or whatever.Do outlier analysis and thresholding based QACasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionTestingDebuggingDebugging MethodologiesIIBetter to catch it at the unit test levelIf you can’t, I suggest the following techniqueIIIInvestigatory map reduce job to find the data causing the issue.Single point if you’re lucky, if not then a random sample usingreservoir samplingTake the data and integrate it into a unit test.Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionTestingDebuggingDebugging MethodologiesIIBetter to catch it at the unit test levelIf you can’t, I suggest the following techniqueIIIIInvestigatory map reduce job to find the data causing the issue.Single point if you’re lucky, if not then a random sample usingreservoir samplingTake the data and integrate it into a unit test.DO NOTIIUse print statements to debug unless you’re sure of the scope.Use counters where the group or name count grows more thana fixed amount.Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionTestingDebuggingDebugging MethodologiesIIBetter to catch it at the unit test levelIf you can’t, I suggest the following techniqueIIIIDO NOTIIIInvestigatory map reduce job to find the data causing the issue.Single point if you’re lucky, if not then a random sample usingreservoir samplingTake the data and integrate it into a unit test.Use print statements to debug unless you’re sure of the scope.Use counters where the group or name count grows more thana fixed amount.DOIIUse a single counter in the actual job if the job doesn’t finishUse a map reduce job that outputs suspect input data intoHDFSCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionHadoop OpinionsI“We’re about a year behind Google” – Doug Cutting, HadoopWorldCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionHadoop OpinionsI“We’re about a year behind Google” – Doug Cutting, HadoopWorldIGiraph and Mahout are just not there yetCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionHadoop OpinionsI“We’re about a year behind Google” – Doug Cutting, HadoopWorldIGiraph and Mahout are just not there yetIHBase is getting there (Facebook is dragging HBase intobeing serious)Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionHadoop OpinionsI“We’re about a year behind Google” – Doug Cutting, HadoopWorldIGiraph and Mahout are just not there yetIHBase is getting there (Facebook is dragging HBase intobeing serious)IZookeeper is the real dealCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionHadoop OpinionsI“We’re about a year behind Google” – Doug Cutting, HadoopWorldIGiraph and Mahout are just not there yetIHBase is getting there (Facebook is dragging HBase intobeing serious)IZookeeper is the real dealICassandra is cool, but eventual consistency is too hard toseriously consider.Casey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionBig DataIWe kind of went overboard w.r.t. Map ReduceIIIEasier than MPI, but really not as flexible.Bringing distributed computing to the masses.meh, maybethe masses don’t need it.M.R. v2 opens up a broader horizonCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionBig DataIWe kind of went overboard w.r.t. Map ReduceIIIIEasier than MPI, but really not as flexible.Bringing distributed computing to the masses.meh, maybethe masses don’t need it.M.R. v2 opens up a broader horizonData analysis is hard and often requires specialized skillsIIIEnter a new breed: the data scientistStats Computer Science Domain knowledgeOften not a software engineerCasey StellaUsing Hadoop: Best Practices

IntroductionBackgroundUsing Hadoop ProfessionallyStaying SaneState of Big Data and HadoopConclusionConclusionIThanks for your attentionIFollow me on twitter @casey stellaFind me tellaP.S. If you dig this stuff, come work with me.Casey StellaUsing Hadoop: Best Practices

Performance Indexing I Hadoop is a batch processing system, but you need realtime access I Options are I Roll your own . State of Big Data and Hadoop Conclusion Testing Debugging Unit/Integration Testing Methodologies I First o , do it