Kafka: The Definitive Guide - Confluent PDF Free Download

1y ago

20 Views

1 Downloads

6.31 MB

322 Pages

Report/dmca

Download PDF

Transcription

ComplimentsofKafkaThe Definitive GuideREAL-TIME DATA AND STREAM PROCESSING AT SCALENeha Narkhede,Gwen Shapira & Todd Palino

Get Started WithApache Kafka TodayCONFLUENT OPEN SOURCEA 100% open source Apache Kafka distribution for building robuststreaming applications.CONNECTORSCLIENTSSCHEMA REGISTRYREST PROXY Thoroughly tested and quality assured Additional client support, including Python, C/C and .NET Easy upgrade path to Confluent EnterpriseStart today at confluent.io/download

Kafka: The Definitive GuideReal-Time Data and Stream Processing at ScaleNeha Narkhede, Gwen Shapira, and Todd PalinoBeijingBoston Farnham SebastopolTokyo

Kafka: The Definitive Guideby Neha Narkhede, Gwen Shapira, and Todd PalinoCopyright 2017 Neha Narkhede, Gwen Shapira, Todd Palino. All rights reserved.Printed in the United States of America.Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions arealso available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐tutional sales department: 800-998-9938 or corporate@oreilly.com.Editor: Shannon CuttProduction Editor: Shiny KalapurakkelCopyeditor: Christina EdwardsProofreader: Amanda KerseyJuly 2017:Indexer: WordCo Indexing Services, Inc.Interior Designer: David FutatoCover Designer: Karen MontgomeryIllustrator: Rebecca DemarestFirst EditionRevision History for the First Edition2017-07-07: First ReleaseSee http://oreilly.com/catalog/errata.csp?isbn 9781491936160 for release details.The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Kafka: The Definitive Guide, the coverimage, and related trade dress are trademarks of O’Reilly Media, Inc.While the publisher and the authors have used good faith efforts to ensure that the information andinstructions contained in this work are accurate, the publisher and the authors disclaim all responsibilityfor errors or omissions, including without limitation responsibility for damages resulting from the use ofor reliance on this work. Use of the information and instructions contained in this work is at your ownrisk. If any code samples or other technology this work contains or describes is subject to open sourcelicenses or the intellectual property rights of others, it is your responsibility to ensure that your usethereof complies with such licenses and/or rights.978-1-491-99065-0[LSI]

Table of ContentsForeword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiPreface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1. Meet Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Publish/Subscribe MessagingHow It StartsIndividual Queue SystemsEnter KafkaMessages and BatchesSchemasTopics and PartitionsProducers and ConsumersBrokers and ClustersMultiple ClustersWhy Kafka?Multiple ProducersMultiple ConsumersDisk-Based RetentionScalableHigh PerformanceThe Data EcosystemUse CasesKafka’s OriginLinkedIn’s ProblemThe Birth of KafkaOpen SourceThe Name123445567810101010101111121414151516v

Getting Started with Kafka162. Installing Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17First Things FirstChoosing an Operating SystemInstalling JavaInstalling ZookeeperInstalling a Kafka BrokerBroker ConfigurationGeneral BrokerTopic DefaultsHardware SelectionDisk ThroughputDisk CapacityMemoryNetworkingCPUKafka in the CloudKafka ClustersHow Many Brokers?Broker ConfigurationOS TuningProduction ConcernsGarbage Collector OptionsDatacenter LayoutColocating Applications on 323236363737393. Kafka Producers: Writing Messages to Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Producer OverviewConstructing a Kafka ProducerSending a Message to KafkaSending a Message SynchronouslySending a Message AsynchronouslyConfiguring ProducersSerializersCustom SerializersSerializing Using Apache AvroUsing Avro Records with KafkaPartitionsOld Producer APIsSummaryvi Table of Contents42444646474852525456596162

4. Kafka Consumers: Reading Data from Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Kafka Consumer ConceptsConsumers and Consumer GroupsConsumer Groups and Partition RebalanceCreating a Kafka ConsumerSubscribing to TopicsThe Poll LoopConfiguring ConsumersCommits and OffsetsAutomatic CommitCommit Current OffsetAsynchronous CommitCombining Synchronous and Asynchronous CommitsCommit Specified OffsetRebalance ListenersConsuming Records with Specific OffsetsBut How Do We Exit?DeserializersStandalone Consumer: Why and How to Use a Consumer Without a GroupOlder Consumer 35. Kafka Internals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95Cluster MembershipThe ControllerReplicationRequest ProcessingProduce RequestsFetch RequestsOther RequestsPhysical StoragePartition AllocationFile ManagementFile FormatIndexesCompactionHow Compaction WorksDeleted EventsWhen Are Topics 10110112112113Table of Contents vii

6. Reliable Data Delivery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115Reliability GuaranteesReplicationBroker ConfigurationReplication FactorUnclean Leader ElectionMinimum In-Sync ReplicasUsing Producers in a Reliable SystemSend AcknowledgmentsConfiguring Producer RetriesAdditional Error HandlingUsing Consumers in a Reliable SystemImportant Consumer Configuration Properties for Reliable ProcessingExplicitly Committing Offsets in ConsumersValidating System ReliabilityValidating ConfigurationValidating ApplicationsMonitoring Reliability in 1261271291301311311337. Building Data Pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135Considerations When Building Data PipelinesTimelinessReliabilityHigh and Varying ThroughputData FormatsTransformationsSecurityFailure HandlingCoupling and AgilityWhen to Use Kafka Connect Versus Producer and ConsumerKafka ConnectRunning ConnectConnector Example: File Source and File SinkConnector Example: MySQL to ElasticsearchA Deeper Look at ConnectAlternatives to Kafka ConnectIngest Frameworks for Other DatastoresGUI-Based ETL ToolsStream-Processing FrameworksSummaryviii Table of 151154155155155156

8. Cross-Cluster Data Mirroring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157Use Cases of Cross-Cluster MirroringMulticluster ArchitecturesSome Realities of Cross-Datacenter CommunicationHub-and-Spokes ArchitectureActive-Active ArchitectureActive-Standby ArchitectureStretch ClustersApache Kafka’s MirrorMakerHow to ConfigureDeploying MirrorMaker in ProductionTuning MirrorMakerOther Cross-Cluster Mirroring SolutionsUber uReplicatorConfluent’s 1781781791809. Administering Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181Topic OperationsCreating a New TopicAdding PartitionsDeleting a TopicListing All Topics in a ClusterDescribing Topic DetailsConsumer GroupsList and Describe GroupsDelete GroupOffset ManagementDynamic Configuration ChangesOverriding Topic Configuration DefaultsOverriding Client Configuration DefaultsDescribing Configuration OverridesRemoving Configuration OverridesPartition ManagementPreferred Replica ElectionChanging a Partition’s ReplicasChanging Replication FactorDumping Log SegmentsReplica VerificationConsuming and ProducingConsole ConsumerConsole 193193193195198199201202202205Table of Contents ix

Client ACLsUnsafe OperationsMoving the Cluster ControllerKilling a Partition MoveRemoving Topics to Be DeletedDeleting Topics ManuallySummary20720720820820920921010. Monitoring Kafka. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211Metric BasicsWhere Are the Metrics?Internal or External MeasurementsApplication Health ChecksMetric CoverageKafka Broker MetricsUnder-Replicated PartitionsBroker MetricsTopic and Partition MetricsJVM MonitoringOS MonitoringLoggingClient MonitoringProducer MetricsConsumer MetricsQuotasLag MonitoringEnd-to-End 23523623623924224324424411. Stream Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247What Is Stream Processing?Stream-Processing ConceptsTimeStateStream-Table DualityTime WindowsStream-Processing Design PatternsSingle-Event ProcessingProcessing with Local StateMultiphase Processing/RepartitioningProcessing with External Lookup: Stream-Table JoinStreaming Joinx Table of Contents248251251252253254256256257258259261

Out-of-Sequence EventsReprocessingKafka Streams by ExampleWord CountStock Market StatisticsClick Stream EnrichmentKafka Streams: Architecture OverviewBuilding a TopologyScaling the TopologySurviving FailuresStream Processing Use CasesHow to Choose a Stream-Processing 78280A. Installing Kafka on Other Operating Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287Table of Contents xi

ForewordIt’s an exciting time for Apache Kafka. Kafka is being used by tens of thousands oforganizations, including over a third of the Fortune 500 companies. It’s among thefastest growing open source projects and has spawned an immense ecosystem aroundit. It’s at the heart of a movement towards managing and processing streams of data.So where did Kafka come from? Why did we build it? And what exactly is it?Kafka got its start as an internal infrastructure system we built at LinkedIn. Ourobservation was really simple: there were lots of databases and other systems built tostore data, but what was missing in our architecture was something that would helpus to handle the continuous flow of data. Prior to building Kafka, we experimentedwith all kinds of off the shelf options; from messaging systems to log aggregation andETL tools, but none of them gave us what we wanted.We eventually decided to build something from scratch. Our idea was that instead offocusing on holding piles of data like our relational databases, key-value stores, searchindexes, or caches, we would focus on treating data as a continually evolving and evergrowing stream, and build a data system—and indeed a data architecture—orientedaround that idea.This idea turned out to be even more broadly applicable than we expected. ThoughKafka got its start powering real-time applications and data flow behind the scenes ofa social network, you can now see it at the heart of next-generation architectures inevery industry imaginable. Big retailers are re-working their fundamental businessprocesses around continuous data streams; car companies are collecting and process‐ing real-time data streams from internet-connected cars; and banks are rethinkingtheir fundamental processes and systems around Kafka as well.So what is this Kafka thing all about? How does it compare to the systems you alreadyknow and use?We’ve come to think of Kafka as a streaming platform: a system that lets you publishand subscribe to streams of data, store them, and process them, and that is exactlyxiii

what Apache Kafka is built to be. Getting used to this way of thinking about datamight be a little different than what you’re used to, but it turns out to be an incrediblypowerful abstraction for building applications and architectures. Kafka is often com‐pared to a couple of existing technology categories: enterprise messaging systems, bigdata systems like Hadoop, and data integration or ETL tools. Each of these compari‐sons has some validity but also falls a little short.Kafka is like a messaging system in that it lets you publish and subscribe to streams ofmessages. In this way, it is similar to products like ActiveMQ, RabbitMQ, IBM’sMQSeries, and other products. But even with these similarities, Kafka has a numberof core differences from traditional messaging systems that make it another kind ofanimal entirely. Here are the big three differences: first, it works as a modern dis‐tributed system that runs as a cluster and can scale to handle all the applications ineven the most massive of companies. Rather than running dozens of individual mes‐saging brokers, hand wired to different apps, this lets you have a central platform thatcan scale elastically to handle all the streams of data in a company. Secondly, Kafka isa true storage system built to store data for as long as you might like. This has hugeadvantages in using it as a connecting layer as it provides real delivery guarantees—itsdata is replicated, persistent, and can be kept around as long as you like. Finally, theworld of stream processing raises the level of abstraction quite significantly. Messag‐ing systems mostly just hand out messages. The stream processing capabilities inKafka let you compute derived streams and datasets dynamically off of your streamswith far less code. These differences make Kafka enough of its own thing that itdoesn’t really make sense to think of it as “yet another queue.”Another view on Kafka—and one of our motivating lenses in designing and buildingit—was to think of it as a kind of real-time version of Hadoop. Hadoop lets you storeand periodically process file data at a very large scale. Kafka lets you store and contin‐uously process streams of data, also at a large scale. At a technical level, there are defi‐nitely similarities, and many people see the emerging area of stream processing as asuperset of the kind of batch processing people have done with Hadoop and its vari‐ous processing layers. What this comparison misses is that the use cases that continu‐ous, low-latency processing opens up are quite different from those that naturally fallon a batch processing system. Whereas Hadoop and big data targeted analytics appli‐cations, often in the data warehousing space, the low latency nature of Kafka makes itapplicable for the kind of core applications that directly power a business. This makessense: events in a business are happening all the time and the ability to react to themas they occur makes it much easier to build services that directly power the operationof the business, feed back into customer experiences, and so on.The final area Kafka gets compared to is ETL or data integration tools. After all, thesetools move data around, and Kafka moves data around. There is some validity to thisas well, but I think the core difference is that Kafka has inverted the problem. Ratherthan a tool for scraping data out of one system and inserting it into another, Kafka isxiv Foreword

a platform oriented around real-time streams of events. This means that not only canit connect off-the-shelf applications and data systems, it can power custom applica‐tions built to trigger off of these same data streams. We think this architecture cen‐tered around streams of events is a really important thing. In some ways these flowsof data are the most central aspect of a modern digital company, as important as thecash flows you’d see in a financial statement.The ability to combine these three areas—to bring all the streams of data togetheracross all the use cases—is what makes the idea of a streaming platform so appealingto people.Still, all of this is a bit different, and learning how to think and build applications ori‐ented around continuous streams of data is quite a mindshift if you are coming fromthe world of request/response style applications and relational databases. This book isabsolutely the best way to learn about Kafka; from internals to APIs, written by someof the people who know it best. I hope you enjoy reading it as much as I have!— Jay KrepsCofounder and CEO at ConfluentForeword xv

PrefaceThe greatest compliment you can give an author of a technical book is “This is thebook I wish I had when I got started with this subject.” This is the goal we set for our‐selves when we started writing this book. We looked back at our experience writingKafka, running Kafka in production, and helping many companies use Kafka to buildsoftware architectures and manage their data pipelines and we asked ourselves,“What are the most useful things we can share with new users to take them frombeginner to experts?” This book is a reflection of the work we do every day: runApache Kafka and help others use it in the best ways.We included what we believe you need to know in order to successfully run ApacheKafka in production and build robust and performant applications on top of it. Wehighlighted the popular use cases: message bus for event-driven microservices,stream-processing applications, and large-scale data pipelines. We also focused onmaking the book general and comprehensive enough so it will be useful to anyoneusing Kafka, no matter the use case or architecture. We cover practical matters suchas how to install and configure Kafka and how to use the Kafka APIs, and we alsodedicated space to Kafka’s design principles and reliability guarantees, and exploreseveral of Kafka’s delightful architecture details: the replication protocol, controller,and storage layer. We believe that knowledge of Kafka’s design and internals is notonly a fun read for those interested in distributed systems, but it is also incrediblyuseful for those who are seeking to make informed decisions when they deploy Kafkain production and design applications that use Kafka. The better you understand howKafka works, the more you can make informed decisions regarding the many tradeoffs that are involved in engineering.One of the problems in software engineering is that there is always more than oneway to do anything. Platforms such as Apache Kafka provide plenty of flexibility,which is great for experts but makes for a steep learning curve for beginners. Veryoften, Apache Kafka tells you how to use a feature but not why you should orshouldn’t use it. Whenever possible, we try to clarify the existing choices, the trade‐xvii

offs involved, and when you should and shouldn’t use the different options presentedby Apache Kafka.Who Should Read This BookKafka: The Definitive Guide was written for software engineers who develop applica‐tions that use Kafka’s APIs and for production engineers (also called SREs, devops, orsysadmins) who install, configure, tune, and monitor Kafka in production. We alsowrote the book with data architects and data engineers in mind—those responsiblefor designing and building an organization’s entire data infrastructure. Some of thechapters, especially chapters 3, 4, and 11 are geared toward Java developers. Thosechapters assume that the reader is familiar with the basics of the Java programminglanguage, including topics such as exception handling and concurrency. Other chap‐ters, especially chapters 2, 8, 9, and 10, assume the reader has some experience run‐ning Linux and some familiarity with storage and network configuration in Linux.The rest of the book discusses Kafka and software architectures in more generalterms and does not assume special knowledge.Another category of people who may find this book interesting are the managers andarchitects who don’t work directly with Kafka but work with the people who do. It isjust as important that they understand the guarantees that Kafka provides and thetrade-offs that their employees and coworkers will need to make while buildingKafka-based systems. The book can provide ammunition to managers who wouldlike to get their staff trained in Apache Kafka or ensure that their teams know whatthey need to know.Conventions Used in This BookThe following typographical conventions are used in this book:ItalicIndicates new terms, URLs, email addresses, filenames, and file extensions.Constant widthUsed for program listings, as well as within paragraphs to refer to program ele‐ments such as variable or function names, databases, data types, environmentvariables, statements, and keywords.Constant width boldShows commands or other text that should be typed literally by the user.Constant width italicShows text that should be replaced with user-supplied values or by values deter‐mined by context.xviii Preface

This element signifies a tip or suggestion.This element signifies a general note.This element indicates a warning or caution.Using Code ExamplesThis book is here to help you get your job done. In general, if example code is offeredwith this book, you may use it in your programs and documentation. You do notneed to contact us for permission unless you’re reproducing a significant portion ofthe code. For example, writing a program that uses several chunks of code from thisbook does not require permission. Selling or distributing a CD-ROM of examplesfrom O’Reilly books does require permission. Answering a question by citing thisbook and quoting example code does not require permission. Incorporating a signifi‐cant amount of example code from this book into your product’s documentation doesrequire permission.We appreciate, but do not require, attribution. An attribution usually includes thetitle, author, publisher, and ISBN. For example: “Kafka: The Definitive Guide by NehaNarkhede, Gwen Shapira, and Todd Palino (O’Reilly). Copyright 2017 Neha Nar‐khede, Gwen Shapira, and Todd Palino, 978-1-491-93616-0.”If you feel your use of code examples falls outside fair use or the permission givenabove, feel free to contact us at permissions@oreilly.com.O’Reilly SafariSafari (formerly Safari Books Online) is a membership-basedtraining and reference platform for enterprise, government,educators, and individuals.Preface xix

Members have access to thousands of books, training videos, Learning Paths, interac‐tive tutorials, and curated playlists from over 250 publishers, including O’ReillyMedia, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Profes‐sional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press,John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, AdobePress, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, andCourse Technology, among others.For more information, please visit http://oreilly.com/safari.How to Contact UsPlease address comments and questions concerning this book to the publisher:O’Reilly Media, Inc.1005 Gravenstein Highway NorthSebastopol, CA 95472800-998-9938 (in the United States or Canada)707-829-0515 (international or local)707-829-0104 (fax)We have a web page for this book, where we list errata, examples, and any additionalinformation. You can access this page at http://oreil.ly/2tVmYjk.To comment or ask technical questions about this book, send email to bookques‐tions@oreilly.com.For more information about our books, courses, conferences, and news, see our web‐site at http://www.oreilly.com.Find us on Facebook: http://facebook.com/oreillyFollow us on Twitter: http://twitter.com/oreillymediaWatch us on YouTube: We would like to thank the many contributors to Apache Kafka and its ecosystem.Without their work, this book would not exist. Special thanks to Jay Kreps, Neha Nar‐khede, and Jun Rao, as well as their colleagues and the leadership at LinkedIn, forcocreating Kafka and contributing it to the Apache Software Foundation.Many people provided valuable feedback on early versions of the book and we appre‐ciate their time and expertise: Apurva Mehta, Arseniy Tashoyan, Dylan Scott, EwenCheslack-Postava, Grant Henke, Ismael Juma, James Cheng, Jason Gustafson, Jeffxx Preface

Holoman, Joel Koshy, Jonathan Seidman, Matthias Sax, Michael Noll, Paolo Castagna,and Jesse Anderson. We also want to thank the many readers who left comments andfeedback via the rough-cuts feedback site.Many reviewers helped us out and greatly improved the quality of this book, so anymistakes left are our own.We’d like to thank our O’Reilly editor Shannon Cutt for her encouragement andpatience, and for being far more on top of things than we were. Working withO’Reilly is a great experience for an author—the support they provide, from tools tobook signings is unparallel. We are grateful to everyone involved in making this hap‐pen and we appreciate their choice to work with us.And we’d like to thank our managers and colleagues for enabling and encouraging uswhile writing the book.Gwen wants to thank her husband, Omer Shapira, for his support and patience dur‐ing the many months spent writing yet another book; her cats, Luke and Lea for beingcuddly; and her dad, Lior Shapira, for teaching her to always say yes to opportunities,even when it seems daunting.Todd would be nowhere without his wife, Marcy, and daughters, Bella and Kaylee,behind him all the way. Their support for all the extra time writing, and long hoursrunning to clear his head, keeps him going.Preface xxi

CHAPTER 1Meet KafkaEvery enterprise is powered by data. We take information in, analyze it, manipulate it,and create more as output. Every application creates data, whether it is log messages,metrics, user activity, outgoing messages, or something else. Every byte of data has astory to tell, something of importance that will inform the next thing to be done. Inorder to know what that is, we need to get the data from where it is created to whereit can be analyzed. We see this every day on websites like Amazon, where our clickson items of interest to us are turned into recommendations that are shown to us alittle later.The faster we can do this, the more agile and responsive our organizations can be.The less effort we spend on moving data around, the more we can focus on the corebusiness at hand. This is why the pipeline is a critical component in the data-drivenenterprise. How we move the data becomes nearly as important as the data itself.Any time scientists disagree, it’s because we have insufficient data. Then we can agreeon what kind of data to get; we get the data; and the data solves the problem. Either I’mright, or you’re right, or we’re both wrong. And we move on.—Neil deGrasse TysonPublish/Subscribe MessagingBefore discussing the specifics of Apache Kafka, it is important for us to understandthe concept of publish/subscribe messaging and why it is important. Publish/subscribemessaging is a pattern that is characterized by the sender (publisher) of a piece of data(message) not specifically directing it to a receiver. Instead, the publisher classifies themessage somehow, and that receiver (subscriber) subscribes to receive certain classesof messages. Pub/sub systems often have a broker, a central point where messages arepublished, to facilitate this.1

How It StartsMany use cases for publish/subscribe start out the same way: with a simple messagequeue or interprocess communication channel. For example, you create an applica‐tion that needs to send monitoring information somewhere, so you write in a directconnection from your application to an app that displays your metrics on a dash‐board, and push metrics over that connection, as seen in Figure 1-1.Figure 1-1. A single, direct metrics publisherThis is a simple solution to a simple problem that works when you are getting startedwith monitoring. Before long, you decide you would like to analyze your metrics overa longer term, and that doesn’t work well in the dashboard. You start a new servicethat can receive metrics, store them, and analyze them. In order to support this, youmodify your application to write metrics to both systems. By now you have threemore applications that are generating metrics, and they all make the same connec‐tions to these two services. Your coworker thinks it would be a good idea to do activepolling of the services for alerting as well, so you add a server on each of the applica‐tions to provide metrics on request. After a while, you have more applications thatare using those servers to get individual metrics and use them for various purposes.This architecture can look much like Figure 1-2, with connections that are evenharder to trace.Figure 1-2. Many metrics publishers, using direct connections2 Chapter 1: Meet Kafka

The technical debt built up here is obvious, so you decide to pay some of it back. Youset up a single application that receives metrics from all the applications out there,and provide a server to query those metrics for any system that needs them. Thisreduces the complexity of the architecture to something similar to Figure 1-3. Con‐gratulations, you have built a publish-subscribe messaging system!Figure 1-3. A metrics publish/subscribe systemIndividual Queue SystemsAt the same time that you have been waging this war with metrics, one of your cow‐orkers has been doing similar work with log messages. Another has been working ontracking user behavior on the frontend website and providing that information todevelopers who are working on machine learning, as well as creating some reports formanagement. You have all followed a similar path of building out systems that decou‐ple the publishers of the information from the subscribers t

Neha Narkhede, Gwen Shapira, and Todd Palino Kafka: The Definitive Guide Real-Time Data and Stream Processing at Scale Beijing Boston Farnham Sebastopol Tokyo