Performance Comparison Between MinIO And Amazon S3 For Starburst Presto SQL

Transcription

Performance comparisonbetween MinIO and Amazon S3for Starburst Presto SQLJULY 2019

Performance comparison between MinIO and AmazonS3 for Starburst Presto SQLMinIO is a high-performance, object storage server designed for AI and ML workloads. It is fullycompatible with the Amazon S3 API.Machine learning, big-data analytics, and other AI workloads have traditionally utilized themap-reduce model of computing where data is local to the compute jobs. Modern computingenvironments have adopted a cloud-native architecture where storage and compute are disaggregated. This enables computing to become stateless, elastic, and scalable independent ofstorage. Object storage has become the de-facto standard for this architecture.Presto is a distributed query engine that can analyze billions of records at very high speeds bydistributing computational tasks across multiple servers. It is used by hyperscalers like Facebook, AirBnB and Dropbox.The performance of underlying storage directly is critically important at this scale.This document describes the benchmarking tests and the results of measuring the performanceof Starburst Presto distribution with MinIO Object Storage.Starburst Presto with MinIO Object Storage250Query Time 02122Query #High Performance Object Storage02

1. Benchmark Environment1.1 HardwareFor the purpose of this benchmark, MinIO utilized AWS bare-metal, storage-optimized instanceswith local NVMe drives and 100 GbE networking. The nodes running Presto (c5n.18xlarge) havethe highest available bandwidth to the S3 service.# NodesAWS Instance typeCPUMEMStorageNetworkPresto Master1i3en.xlarge432 GB1 x 2500 GBUpto 25 GbpsPresto Worker8c5n.18xlarge72192 GBEBS100 GbpsMinIO Server8i3en.24xlarge96768 GB8 x 7500 GB100 n.18xlargec5n.18xlargec5n.18xlargeS3 APIStorage100 Gbe NetworkMinIONode 1MinIONode 2MinIONode 3MinIONode 4MinIONode 5MinIONode 6MinIONode 7MinIONode large1.2 SoftwarePropertyValuePrestoStarburst Distribution: chmarkTPC-H Benchmark Scaling Factor: 1000Server OSCentOS Linux 7 (Core)High Performance Object Storage03

TPC-H benchmarkTPC Benchmark H is comprised of a set of business queries designed to exercise systemfunctionalities in a manner representative of complex business analysis applications. Thesequeries portray the activity of a wholesale supplier and add necessary context to thecomponents of the benchmark.DatasetTPC Benchmark H, provides its own dataset comprising of eight tables representing a complexbusiness environment. The tables are interrelated to each other, facilitating complex queriesacross multiple tables.The size of the dataset is variable and chosen based on the underlying storage system. The sizeof the dataset is determined based on a scaling factor. A scaling factor of one leads to a datasetapproximately 1GB in size, scaling factor 100 generates a dataset approximately 100GB in sizeand so on. The scaling factor is plugged into a dataset generation tool to generate the data.This benchmark used scaling factor 1000. A summary detailing the dataset is presented below:Table# Records# Records (SF: 1000)Customer150,000 * SF150,000,000Orders1,500,000 * SF1,500,000,000Lineitem6,000,000 * SF6,000,000,000Supplier10,000 * SF10,000,000Part200,000 * SF200,000,000PartSupp800,000 * SF800,000,000Region55Nation2525Total8.66 BillionThe data was formatted in ORC (Optimized Row Columnar) format, and stored in a MinIObucket. Converting to this format automatically compresses the data, which shrunk the datasize to 273 GB.High Performance Object Storage04

Starburst Presto Performance TuningStarburst Presto was configured to utilize 1TB of aggregate memory across 8 worker nodesusing the following settings: cat etc/config.properties grep ‘query.max-memory’query.max-memory 1024GBquery.max-memory-per-node 128GBJVM configuration was updated to complement the above settings cat etc/jvm.properties grep ‘Xmx’-Xmx192GThe Hive connector was utilized to connect Starburst Presto to Minio. Hive connector was tunedusing the following edfalseIn addition to the above optimizations, Starburst Presto was configured to run queries only onthe higher capacity worker nodes and never run on the master node. This was achieved withthe following setting: cat etc/config.properties grep ator falseHigh Performance Object Storage05

2. Benchmark ResultsThe time taken for each of the 22 TPC-H queries is presented below:Node #Query Execution Time (seconds)Presto w/ 57.31926.120206.421148.2227.7Average47.4The values presented above serve to provide a reference point that will be used to benchmarkfuture versions of MinIO and other applications serving similar use cases.High Performance Object Storage06

3. Comparing MinIO to Amazon S3The same benchmark tests were run against data stored in Amazon S3 using the same hardware for Starburst Presto. It should be noted that MinIO is strictly consistent, whereas AmazonS3 in only eventually consistent. The performance was largely the same with some queries slowerthan MinIO and others faster. A graph summarizing the query times comparing MinIO and S3 forStarburst Presto workloads is presented below:Starburst Presto with MinIO Object StoragePresto w/ MinIOPresto w/ S3250Query Time 02122Query #4. ResultsThe results show that the performance difference between running Starburst Presto backed byS3, as compared to Starburst Presto backed by MinIO is negligible - making MinIO anattractive alternative for large scale, high performance, data-intensive workloads in privatecloud environments.High Performance Object Storage07

S3 for Starburst Presto SQL MinIO is a high-performance, object storage server designed for AI and ML workloads. It is fully compatible with the Amazon S3 API. Machine learning, big-data analytics, and other AI workloads have traditionally utilized the map-reduce model of computing where data is local to the compute jobs. Modern computing