Kappa Architecture Our Experience

Transcription

Kappa ArchitectureOur Experiencediciembre 2010

Who am ICDO ASPgemsFormer President of Hispalinux (SpanishLUG)Author “La Pastilla Roja” first spanish bookabout Free Software.

MenuA little context about Kappa ArchitectureWhat’s Kappa ArchitectureWhat is not Kappa ArchitectureHow we implement itReal use cases with KA

A little contextJuly 2, 2014 Jay Kreps coined the termKappa Architecture in an article forO’reilly Radar

Who is Jay KrepsJay has been involved in lots of projects:Author of the essay:The Log: What every software engineershould know about real-time data'sunifying abstraction uld-know-about-real-time-datas-unifying

Jay KrepsAuthor of the book: I Logs

Jay KrepsInvolved with projects as:Apache KafkaApache SamzaVoldemortAzkabanEx-LinkedinNow co-founder and CEO of Confluent

Lambda ArchitectureLook something like architecture

Lambda ArchitectureBatch layer that provides the followingfunctionalitymanaging the master dataset, animmutable, append-only set of rawdata.pre-computing arbitrary queryfunctions, called batch -architecture

Lambda ArchitectureServing layerThis layer indexes the batch views sothat they can be queried in ad hocwith low latency.Speed layerThis layer accommodates all requeststhat are subject to low latencyrequirements. Using fast andincremental algorithms, the speedlayer deals with recent data only.

Lambda Architecturebatch layer datasets can be in a distributedfilesystem, while MapReduce can be used to createbatch views that can be fed to the serving layer.The serving layer can be implemented using NoSQLtechnologies such as HBase,Apache Druid, etc.Querying can be implemented by technologies such asApache Drill or ImpalaSpeed layer can be realized with data streamingtechnologies such as Apache Storm or Spark bda-architecture

Pros of LambdaArchitectureRetain the input data unchanged.Think about modeling data transformations,series of data states from the original input.Lambda architecture take in account the problemof reprocessing data.this happens all the time, the code willchange, and you will need to reprocess all theinformation. Lots of reasons and you will needto live with this.

Cons of LambdaArchitectureMaintain the code that need to produce the sameresult from two complex distributed system ispainful.Very different code for MapReduce and Storm/Apache SparkNot only is about different code, is also aboutdebugging and interaction with other products like(hive, Oozie, Cascading, etc)At the end is a problem about different anddiverging programming paradigms.

So what is KappaArchitectureThe proposal of Jay Kreps is so simple:Use kafka (or other system) that will let youretain the full log of the data you need toreprocess.When you want to do the reprocessing, start asecond instance of your stream processing jobthat starts processing from the beginning ofthe retained data, but direct this output data toa new output table.

So what is KappaArchitecturepart IIWhen the second job has caught up, switch theapplication to read from the new table.Stop the old version of the job, and delete theold output table.

So what is KappaArchitecturepart IIWhen the second job has caught up, switch theapplication to read from the new table.Stop the old version of the job, and delete theold output table.

So what is KappaArchitecturepart IIWhen the second job has caught up, switch theapplication to read from the new table.Stop the old version of the job, and delete theold output table.

So what is KappaArchitectureThis architecture looks something like this:

So what is KappaArchitectureThe first benefit is that only you need toreprocessing only when you change the code.You can check if the new version is working ok andif not reverse to the old output table.You can mirror a Kafka topic to HDFS so you arenot limited to the Kafka retention configuration.You have only a code to maintain with an uniqueframework.

So what is KappaArchitectureThe real advantage is not about efficiency at all(You will need extra temporarily storage whenreprocessing for example) is allowing your teamto develop, test, debug and operate their systemson top of a single processing framework.

What is not KappaArchitectureIs not a silver bullet to solve every problem atBig Data.Is not a list of prescriptions of technologies. Youcan implement with your favorite frameworks.Is not a rigid set of rules. But helps to maintainthe complex projects simple.

How we use KappaArchitectureWe start working with projects with a complexstructure like Linkedin looks at early stage.That’s very usual.

How we use KappaArchitecture

How we use KappaArchitectureWe try to refactoring the data flows to fix in aKappa Architecture.

How we use KappaArchitecture

How we use KappaArchitectureWe use Kafka as Stream Data PlatformInstead of Samza we feel more comfortable withSpark Streaming.At ASPGems we choose Apache Spark as ourAnalytics Engine and not only for SparkStreaming.

How we use KappaArchitectureAt the end, Kappa Architecture is design patternfor us.We use/clone this pattern in almost our projects.We have projects of every size, volume of dataor speed needing and fix with the KappaArchitecture.

Use Cases

Telefónica - MSSWe use KA to calculate near real time KPIs,SLAs related with the managed security system.We simplify the data flow of the input data.Kafka in the streaming data platform.As MPP we use CassandraDB.

IOT - OBD IIOne of our clients install On Board Devices inthe cars of its customers.We implement an API to got all the informationin real time and inject the information in Kafka.The business rules are implemented in a CEPrunning into Apache Spark Streaming.As MPP we use Elastic Search.

Questions

Thank youJuantomás Garcíajuantomas@aspgems.com@juantomasdiciembre 2010

Lambda Architecture batch layer datasets can be in a distributed filesystem, while MapReduce can be used to create batch views that can be fed to the serving layer. The serving layer can be implemented using NoSQL techn