Resource Utilization Comparison Of Cassandra And Elasticsearch

Transcription

URI: bth-18665Resource utilization comparison ofCassandra and ElasticsearchNizar SelanderSeptember 2019Faculty of ComputingBlekinge Institute of TechnologySE-371 79 Karlskrona, Sweden

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technologyin partial fulfillment of the requirements for the bachelor’s degree in softwareengineering. The thesis is equivalent to 10 weeks of full-time studies.Contact Information:Author:Nizar Selandernizar.selander@gmail.comExternal advisors:Ruwan Lakmal Silvaruwan.lakmal.silva@ericsson.comPär Karlssonpar.a.karlsson@ericsson.comUniversity advisors:Krzysztof Wnukkrzysztof.wnuk@bth.seConny Johanssonconny.johansson@bth.seFaculty of ComputingInternet : www.bth.seBlekinge Institute of TechnologyPhone : 46 455 38 50 00SE-371 79 Karlskrona, SwedenFax : 46 455 38 50 57

I. AbstractElasticsearch and Cassandra are two of the widely used databases today withElasticsearch showing a more recent resurgence due to its unique full text searchfeature, akin to that of a search engine, contrasting with the conventional querylanguage-based methods used to perform data searching and retrieval operations.The demand for more powerful and better performing yet more feature rich andflexible databases has ever been growing. This project attempts to study how the twodatabases perform under a specific workload of 2,000,000 fixed sized logs and underan environment where the two can be compared while maintaining the results of theexperiment meaningful for the production environment which they are intended for.A total of three benchmarks were carried, an Elasticsearch deployment using defaultconfiguration and two Cassandra deployments, a default configuration a long with amodified one which reflects a currently running configuration in production for thetask at hand.The benchmarks showed very interesting performance differences in terms of CPU,memory and disk space usage. Elasticsearch showed the best performance overallusing significantly less memory and disk space as well as CPU to some degree.However, the benchmarks were done in a very specific set of configurations and a veryspecific data set and workload. Those differences should be considered whencomparing the benchmark results.Keywords: Databases, Benchmark, Performance, Kubernetes, Cassandra, Elasticsearch.

II. AcknowledgmentI would like to thank my incredible supervisors Ruwan Lakmal Silva, Pär Karlsson andStefan Wallin at Ericsson as well as Krzysztof Wnuk and Conny Johansson at BTH for allthe guidance and support they have given me and for affording me this valuableopportunity. I would also like to thank my family for their support in making myachievements possible.

III. ContentsI.AbstractiiiII.AcknowledgmentivIII. Contents51.Introduction71.1. Context . 71.2. Problem . 71.3. Target Group . 81.4. Delimitations. 81.5. Research questions . 82.Background92.1. History and Overview. 92.2. Flat model .102.3. Relational .132.4. Post-Relational .163.Environment173.1. App .173.2. Stream processing.173.3. Log stashing.183.4. Storage .183.5. Management .184.Experiment194.1. Method .194.2. Design .204.3. Data replication & consistency .224.4. Deployment .235.Results305.1. CPU usage .305.1.1. Cassandra default configuration: CPU usage.315.1.2. Cassandra modified configuration: CPU usage .315.1.3. Elasticsearch default configuration: CPU usage .325.1.4. All deployments: Total CPU usage .325.2. Memory .335.2.1. Cassandra default configuration: Memory usage .335.2.2. Cassandra modified configuration: Memory usage .345.2.3. Elasticsearch default configuration: Memory usage .345.2.4. All deployments: Total memory usage .355.3. Disk space .355.3.1. Cassandra default configuration: Commit disk usage .365

5.3.2.5.3.3.5.3.4.5.3.5.5.3.6.Cassandra default configuration: Data disk usage .36Cassandra modified configuration: Commit disk usage .37Cassandra modified configuration: Data disk usage .37Elasticsearch default configuration: Data disk usage .38All deployments: Total disk usage .386.Analysis396.1. CPU utilization analysis .396.1.1. CPU usage vs logs output rate .396.1.2. CPU usage vs logs output delta .416.1.3. CPU usage vs Producer and Logstasher CPU usage .436.2. Memory utilization analysis.456.2.1. Memory usage vs logs output rate .456.2.2. Memory usage vs Producer & Logstasher CPU usage.476.3. Disk space utilization analysis .486.3.1. Disk space usage vs logs output rate .487.Conclusion8.Concluding remarks528.1. Summary .528.2. Limitations .528.3. Future work .529.References50536

1. IntroductionEver since humanity developed its means of written communication and evolved fromoral cultures to ones capable of storing their knowledge and preserving it, it enabledthem the means to preserve and record information vital to their identity, beliefs,culture and trade to name a few. Although the fundamental requirements of physicalinformation storage have been largely left unchanged on the very basic level,requirements such as reliability, security, accessibility and cost, the challenges we facetoday in meeting those demands are far more complex [1]. Largely due to the scale andspeed at which we operate at and the ever-growing technology and how it continues toaffect the world around us and how we interact with it.As more and more processes that underpin our infrastructure and business world arebeing digitalized. Many of them have found a home in distributed cloud computingplatforms. With that, the demand for more and more efficient, more reliable and betteruse case specific technologies for handling, managing and processing such operationshas increased as evident by number of choices available for consideration.The topic for this thesis project, a study into the resource utilization of Cassandra [1]and Elasticsearch [2], deals with such an attempt at comparing two availabletechnologies and seeks to provide empirical and relevant information to aid in thedecision making between the two databases in a real-world scenario.1.1. ContextThe purpose of this research is to study and examine how two databases performcompared to each other in terms of resource utilization. Cassandra, one of the twotechnologies, is currently used in production at Ericsson to store operation andtransaction logs which are produced by applications in a Kubernetes system.Elasticsearch, an alternative solution, is being considered as a potential replacement.Through benchmarking how the two perform and by measuring their resourceutilization in terms of CPU, disk and memory usage, the results of the benchmarkingexperiment will provide insight into how the two compare and will allow for a costbenefit analysis to be conducted on whether Elasticsearch is a suitable alternative forthe use case at hand.1.2. ProblemCassandra, a wide column storage database, is currently used in one of the productsoffered by Ericsson for storing logs. While it is handling the task at hand just fine, analternative database, Elasticsearch, offers features which are interesting and useful forthe product which it is considered to be implemented in. Full text search is such afeature, it allows for searching through records stored in the database similar to howone would use a search engine. This is largely due to its document-based data modelwhich allows for greater operational flexibility. The question however is, at what costwould such a feature come? How would it perform compared to the currentlyimplemented database in terms of resource utilization? Those are the questions whichthis thesis aims to answer. Through putting the two in a controlled environment in7

which they are benchmarked under the same workload, the data will help usunderstand how the two utilize resources in terms CPU, disk and memory usage.1.3.Target GroupThis research is aimed at organizations, groups and individuals with an interest infinding information about how Elasticsearch would perform compared to Cassandra interms of resource utilization and in a real-world use case. The study could also be ofinterest for those looking to explore how such a problem could be tackled as well asthose who want further their knowledge in this field.1.4. DelimitationsThe study deals with a very specific deployment, configuration, workload, data set,database versions and dependencies. It is important to keep this in mind when drawingconclusions from the results in this thesis.1.5. Research questions1.5.1.1. RQ.1: How would Cassandra and Elasticsearch perform in terms ofresource utilization under a heavy workload?1.5.1.2. RQ.2: What factors influence their resource utilizations?8

2. BackgroundThis study focuses on cost-benefit analysis of the performance of Cassandra databaseand Elasticsearch in a real-world problem in co-operation with Ericsson. The goal is toestablish how the two technologies perform relative to each other in a data storing taskin a Kubernetes environment. To begin with, we will look at a brief history of thedevelopment of databases and how we got here in this chapter to familiarize ourselveswith key concepts.2.1. History and OverviewAlthough databases are strongly associated with computers and digital information,humans have stored and cataloged information long before the computer era. Arecently uncovered Sumerian medical tablet that dates to 2400 BC lists 15prescriptions used by a pharmacist [2] gives us a glance at how far back the practicedates in our history and its fascinating progression. The concepts and philosophiesused to build and improve those systems have both formed and help guide thedevelopment of databases to where they are today.Techno

Cassandra, a wide column storage database, is currently used in one of the products offered by Ericsson for storing logs. While it is handling the task at hand just fine, an alternative database, Elasticsearch, offers features which are interesting and useful for the product which it is considered to be implemented in. Full text search is such a