Alibaba Cloud Computing Ltd.

Transcription

Alibaba Cloud Computing Ltd.TPC Benchmark DSFull Disclosure ReportforAlibaba Cloud E-MapReduce(with 41 Alibaba Cloud Elastic Compute Service Servers)usingE-MapReduce 3.21.2andCentOS Linux Release 7.4Second Edition (first edition released on September 16, 2019)April 4, 2021

2Second Edition – April, 2021Alibaba Cloud and the Alibaba Cloud Logo are trademarks of Alibaba Group and/or its affiliates in the U.S. andother countries.The Alibaba Cloud products, services or features identified in this document may not yet be available or may notbe available in all areas and may be subject to change without notice. Consult your local Alibaba Cloud businesscontact for information on the products or services available in your area. You can find additional information viaAlibaba Cloud’s international website at https://www.alibabacloud.com/. Actual performance and environmentalcosts of Alibaba Cloud products will vary depending on individual customer configurations and conditions.Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

3Table of ContentsAbstract5PrefaceTPC BenchmarkTM DS Overview1111General Items0.1 Test Sponsor0.2 Parameter Settings0.3 Configuration Diagrams12121212Clause 2: Logical Database Design Related Items2.1 Database Definition Statements2.2 Physical Organization2.3 Horizontal Partitioning2.4 Replication1515151515Clause 3: Scaling and Database Population3.1 Initial Cardinality of Tables3.2 Distribution of Tables and Logs Across Media3.3 Mapping of Database Partitions/Replications3.4 Implementation of RAID3.5 DBGEN Modifications3.6 Database Load time3.7 Data Storage Ratio3.8 Database Load Mechanism Details and Illustration3.9 Qualification Database Configuration16161717181818181819Clause 4 and 5: Query and Data Maintenance Related Items4.1 Query Language4.2 Verifying Method of Random Number Generation4.3 Generating Values for Substitution Parameters4.4 Query Text and Output Data from Qualification Database4.5 Query Substitution Parameters and Seeds Used4.6 Refresh Setting4.7 Source Code of Refresh Functions4.8 Staging Area202020202021212121Clause 6: Data Persistence Properties Related Items22Clause 7: Performance Metrics and Execution Rules Related Items7.1 System Activity7.2 Test Steps7.3 Timing Intervals for Each Query and Refresh Function7.4 Throughput Test Result7.5 Time for Each Stream7.6 Time for Each Refresh Function7.7 Performance Metrics2323232323232323Clause 8: SUT and Driver Implementation Related Items24Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

48.1 Driver8.2 Implementation Specific Layer (ISL)8.3 Profile-Directed Optimization242424Clause 9: Pricing Related Items9.1 Hardware and Software Used9.2 Availability Date9.3 Country-Specific Pricing25252525Clause 11: Audit Related ItemsAuditor’s Information and Attestation Letter2626Supporting Files Index28Appendix A: Purchase Page of Creating Alibaba Cloud E-MapReduce Cluster with1-Year Subscription29Appendix B: Third Party Price QuotesAlibaba CloudE-MapReduceFull Disclosure Report30TPC-DS 3.0.0

5AbstractThis document contains the methodology and results of the TPC Benchmark DS (TPC-DS) test conducted inconformance with the requirements of the TPC-DS Standard Specification, Revision 3.0.0.The test was conducted at a Scale Factor of 100000GB with 41 Alibaba Cloud Elastic Compute Service Serversrunning E-MapReduce 3.21.2 on CentOS Linux Release 7.4.Measured ConfigurationCompany NameCluster NodeDatabase SoftwareOperation SystemAlibaba CloudComputing Ltd.Alibaba Cloud ElasticCompute Service ServerAlibaba CloudE-MapReduce 3.21.2CentOS Linux Release7.4TPC Benchmark DS MetricsTotal System Cost(USD)TPC-DS Throughput(QphDS@100000GB)Price/Performance(USD /kQphDS@100000GB)Availability Date 2,604,064.6814,861,137 175.23As of PublicationAlibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

6TPC-DS: 3.0.0TPC-Pricing: 2.4.0Report Date: Apr. 2, 2021Alibaba CloudE-MapReduceTotal System CostTPC-DSThroughputPrice / Performance 2,604,064.6814,861,137 175.23USDQphDS@100000GBUSD/kQphDS@100000GBDataset Size1Database ManagerOperation SystemOtherSoftwareCluster100,000 GBE-MapReduce3.21.2CentOS Linux Release7.4N/AYesSystem Availability DateAs of PublicationDM21,252.41%LOAD11,830.0PT 39%Benchmarked ConfigurationElapsed TimeLoad includes backup NoRAID RAID-1 for metadata; HDFS with 3-wayreplication for table dataSystem Configuration:Servers:Total Processors/Cores/Threads:Total Memory:Total Storage :Storage Ratio :23Server Configuration:Processors:Memory:Network:Storage Device:Server Configuration:Processors:Memory:Network:Storage Device:Alibaba Cloud E-MapReduce Cluster1 x ecs.hfg5.6xlarge 40 x ecs.i2g.16xlarge41/1,292/2,58410,336 GB290,480 GB2.91Per node (ecs.hfg5.6xlarge)Intel(R)Xeon(R) Gold 6149 CPU @ 3.10GHz, 22 MB L396 GBBandwidth: 4.5 Gbps, Packet forwarding rate: 2,000,0003 x 100 GB SSD Cloud Disk (data disk)1 x 100 GB SSD Cloud Disk (boot disk)Per node (ecs.i2g.16xlarge)Intel(R)Xeon(R) Platinum 8163 CPU @ 2.50GHz, 33 MB L3256 GBBandwidth: 10.0 Gbps, Packet forwarding rate: 4,000,0004 x 1788 GB NVMe SSD Local Disk (data disk)1 x 100 GB Ultra Cloud Disk (boot disk)1. Dataset Size includes only raw data (i.e., no temp, index, redundant storage space, etc.).2. Total Storage (100 100 * 3) (Master node) (100 1,788 * 4) * 40 (Worker nodes) 290,480 GB3. Storage Ratio Total Storage / SF 290,480 GB / 100,000 GBAlibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

7Alibaba CloudTPC-DS: 3.0.0TPC-Pricing: 2.4.0Report Date: Apr. 2, 2021E-MapReduceDescriptionPart Number SrcUnit PriceQty(USD)Ext. Price(USD)3-Year Maint.(USD)Licensed Compute ServicesVirtual cloud serverECS Instance ecs.hfg5.6xlargeECS System Disk (SSD Cloud Disk 100GB)ECS Data Disk (SSD Cloud Disk 100GB)Virtual cloud serverECS Instance ecs.i2g.16xlarge- NVMe SSD Local Disk (4 x 1788 GB)ECS System Disk (Ultra Cloud Disk 100GB)ecs.hfg5.6xlarge(China North .003-Year Cost of Ownership:2,604,064.68QphDS@100000GB:14,861,137 /kQphDS@100000GB:175.23ecs.i2g.16xlarge118,628.46 120(China North 2)IncludedOption178.54 120Licensed Compute Services Sub-TotalLicensed Software Servicesemr.hfg5.6xlarge1818.4888 3(China North 2)emr.i2g.16xlarge12,793.9840 120(China North 2)Licensed Software Services Sub-TotalE-MapReduce for emr.hfg5.6xlargeE-MapReduce for emr.i2g.16xlargeOther ComponentsLenovo 120S-14IAP Laptop (Includes spares)81A5001UUS 2149.99 3Other Components Sub-Total1 Alibaba Cloud, 2 Bestbuy.comAll Licensed Services prices are per year and based on 1-year pre-paidsubscriptions.Audited by Francois Raab, InfoSizingPrices used in TPC benchmarks reflect the actual prices a customer would pay for a one-time purchase of the stated components.Individually negotiated discounts are not permitted. Special prices based on assumptions about past or future purchases are notpermitted. All discounts reflect standard pricing policies for the listed components. For complete details, see the pricing sections ofthe TPC benchmark specifications. If you find that the stated prices are not available according to these terms, please inform atpricing@tpc.org. Thank you.Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

8Alibaba CloudE-MapReduceTPC-DS: 3.0.0TPC-Pricing: 2.4.0Report Date: Apr. 2, 2021Metrics Details:NameValueUnitScale Factor (SF)100,000GBStreams4StreamQueries (Q)396QueriesT load11,830.0SecondT ld0.1315HourT power16,093.5SecondT pt17.8817HourT tt154,184.7SecondT tt255,689.1SecondT dm11,276.3SecondT dm21,252.4SecondT tt30.5205HourT dm0.7025HourLoad StepStartEnd(sec.)(hh:mm:ss)Build08/25/19 12:58:52.3808/25/19 16:16:02.3411,829.963:17:10Audit08/25/19 16:16:02.3408/25/19 17:32:28.844,586.501:16:26Finish08/25/19 17:32:28.8408/25/19 17:32:28.840.000:00:00Reported08/25/19 12:58:52.3808/25/19 m:ss)Power08/25/19 19:57:45.0908/26/19 00:25:58.5216,093.444:28:13Thruput-108/26/19 00:25:58.5408/26/19 15:29:03.1854,184.6415:03:05Thruput-208/26/19 15:50:19.4808/27/19 07:18:28.5655,689.0815:28:09DM-108/26/19 15:29:03.2008/26/19 15:50:19.461,276.260:21:16DM-208/27/19 07:18:28.5808/27/19 mm:ss)Pt - 008/25/19 19:57:45.0908/26/19 00:25:58.5216,093.444:28:13Tt1 - 108/26/19 00:25:58.5408/26/19 15:27:16.2254,077.6915:01:18Tt1 - 208/26/19 00:25:58.5408/26/19 15:29:03.1854,184.6415:03:05Tt1 - 308/26/19 00:25:58.5408/26/19 15:28:20.1154,141.5715:02:22Tt1 - 408/26/19 00:25:58.5408/26/19 15:17:44.1753,505.6314:51:46Tt2 - 508/26/19 15:50:19.4808/27/19 07:18:28.5655,689.0815:28:09Tt2 - 608/26/19 15:50:19.4808/27/19 07:04:23.9754,844.4915:14:04Tt2 - 708/26/19 15:50:19.4808/27/19 06:58:46.0254,506.5415:08:27Tt2 - 808/26/19 15:50:19.4808/27/19 06:47:42.7653,843.2814:57:23DMt1 - 108/26/19 15:29:03.2008/26/19 15:40:16.00672.790:11:13DMt1 - 208/26/19 15:40:16.0008/26/19 15:50:19.46603.460:10:03DMt2 - 308/27/19 07:18:28.5808/27/19 07:28:55.60627.020:10:27DMt2 - 408/27/19 07:28:55.6108/27/19 07:39:20.93625.320:10:25Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

9Timing Intervals for Each Query (In Seconds)QueryStream 4268.426.521.811.221.2928.885.3Stream 1 Stream 2 Stream 3 Stream 130.2183.4235.6302.9293.6166.3 1,386.9324.4166.3261.8309.0590.0 587.3 1,204.7 1,409.0 1,631.6 1,204.7 1,357.9 1,498.2 1,598.4 7.5221.2540.1296.1221.2277.4418.1557.0 607.5547.2676.9618.5933.8547.2600.7647.7741.1 933.832.033.2273.5142.632.032.987.9175.3 273.51,624.1 1,436.9 1,280.3 1,131.8 1,131.8 1,243.2 1,358.6 1,483.7 1,624.11,226.1618.5 1,448.6731.3618.5703.1978.7 1,281.7 250.292.3122.887.4397.5 41.6161.1121.3408.7162.0217.8 250.2151.4 237.1686.2 1,485.4276.1 565.7947.0307.0442.3 .0591.9 25.3320.5413.938.2535.9 661.8580.7 1,054.591.0 .2427.8Alibaba CloudE-MapReduceStream 5 Stream 6 Stream 7 Stream 2490.4Full Disclosure 68.0150.61,260.71,428.1449.4TPC-DS 3.0.0

94.5435.846.5645.1399.2430.3105.5145.0105.2 453.6432.6505.3 713.9226.4459.5 645.1686.7 1,091.7 1,537.6320.0542.9 686.8658.3991.7 1,726.2Timing Intervals for Each Refresh Function (In Seconds)DM FxLF CRLF CSLF ILF SRLF SSLF WRLF WSDF CSDF SSDF WSDF IR-Run 1 R-Run 2 R-Run 3 R-Run 8Alibaba 439.290.8276.589.8189.5266.6301.9253.579.5Full Disclosure ReportTPC-DS 3.0.0

11PrefaceTPC BenchmarkTM DS OverviewThe TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicableaspects of a decision support system, including queries and data maintenance. The benchmark provides arepresentative evaluation of performance as a general purpose decision support system.This benchmark illustrates decision support systems that: Examine large volumes of data;Give answers to real-world business questions;Execute queries of various operational requirements and complexities (e.g., ad-hoc, reporting, iterativeOLAP, data mining);Are characterized by high CPU and IO load;Are periodically synchronized with source OLTP databases through database maintenance functions.Run on “Big Data” solutions, such as RDBMS as well as Hadoop/Spark based systems.A benchmark result measures query response time in single user mode, query throughput in multi user mode anddata maintenance performance for a given hardware, operating system, and data processing system configurationunder a controlled, complex, multi-user decision support workload.The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. To achievethat purpose, TPC benchmark specifications require benchmark tests be implemented with systems, products,technologies and pricing that:a)b)c)Are generally available to users;Are relevant to the market segment that the individual TPC benchmark models or represents (e.g., TPC-DSmodels and represents complex, high data volume, decision support environments);Would plausibly be implemented by a significant number of users in the market segment modeled orrepresented by the benchmark.In keeping with these requirements, the TPC-DS database must be implemented using commercially available dataprocessing software, and its queries must be executed via SQL interface. The use of new systems, products,technologies (hardware or software) and pricing is encouraged so long as they meet the requirements above.Specifically prohibited are benchmark systems, products, technologies or pricing (hereafter referred to as"implementations") whose primary purpose is performance optimization of TPC benchmark results without anycorresponding applicability to real-world applications and environments. In other words, all "benchmark special"implementations, which improve benchmark results but not real-world performance or pricing, are prohibited.TPC benchmark results are expected to be accurate representations of system performance. Therefore, there arespecific guidelines that are expected to be followed when measuring those results. The approach or methodologyto be used in the measurements are either explicitly described in the specification or left to the discretion of the testsponsor.When not described in the specification, the methodologies and approaches used must meet the followingrequirements: The approach is an accepted engineering practice or standard;The approach does not enhance the result;Equipment used in measuring the results is calibrated according to established quality standards;Fidelity and candor is maintained in reporting any anomalies in the results, even if not specified in thebenchmark requirements.Further information is available at http://www.tpc.org/Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

12General Items0.1 Test SponsorA statement identifying the benchmark sponsor(s) and other participating companies must be provided.This benchmark was sponsored by Alibaba Cloud Computing Ltd.0.2 Parameter SettingsSettings must be provided for all customer-tunable parameters and options which have been changed from thedefaults found in actual products, including by not limited to:lDatabase Tuning OptionslOptimizer/Query execution optionslQuery processing tool/language configuration parameterslRecovery/commit optionslConsistency/locking optionslOperating system and configuration parameterslConfiguration parameters and options for any other software component incorporated into the pricingstructurelCompiler optimization optionsThis requirement can be satisfied by providing a full list of all parameters and options, as long as all those whichhave been modified from their default values have been clearly identified and these parameters and options areonly set once.The Supporting File Archive (Clause 8) contains the Operating System and DBMS parameters used in thisbenchmark.0.3 Configuration DiagramsDiagrams of both measured and priced configurations must be provided, accompanied by a description of thedifferences. This includes, but is not limited to:lNumber and type of processorslSize of allocated memory, and any specific mapping/partitioning of memory unique to the test. Number andtype of disk units (and controllers, if applicable).lNumber of channels or bus connections to disk units, including their protocol type.lNumber of LAN (e.g. Ethernet) Connections, including routers, workstations, terminals, etc., that werephysically used in the test or are incorporated into the pricing structure.lType and the run-time execution location of software components (e.g., DBMS, query processingtools/languages, middle-ware components, software drivers, etc.).Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

13Measured ConfigurationFigure 0.3: Measured ConfigurationThe measured configuration consisted of 19 Nodes:Master node details (1 node):lECS Instance Type: ecs.hfg5.6xlargelProcessors/Cores/Threads: 1/12/24lProcessor Model: Intel(R)Xeon(R) Gold 6149 CPU @ 3.10GHz, 22 MB L3lMemory: 96 GBlStorage:ln3 x 100 GB SSD Cloud Disk (data disk)n1 x 100 GB SSD Cloud Disk (boot disk)Network:nBandwidth (Gbit/s): 4.5nPacket forwarding rate (Thousand pps): 2,000nNIC queues: 6nENIs: 8Worker nodes details (40 nodes):lECS Instance Type: ecs.i2g.16xlargelProcessors/Cores/Threads: 1/32/64lProcessor Model: Intel(R)Xeon(R) Platinum 8163 CPU @ 2.50GHz, 33 MB L3lMemory: 256 GBlStorage:ln4 x 1788 GB NVMe SSD Local Disk (data disk)n1 x 100 GB Ultra Cloud Disk (boot disk)Network:nBandwidth (Gbit/s): 10.0Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

14nPacket forwarding rate (Thousand pps): 4,000nNIC queues: 16nENIs: 8EMR System Components 1-40Resource ManagerSparkNode ManagerxxThrift ServerExecutorxxxPriced ConfigurationThere are no differences between the priced and measured configurations.Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

15Clause 2: Logical Database Design Related Items2.1 Database Definition StatementsListings must be provided for the DDL scripts and must include all table definition statements and all otherstatements used to set up the test and qualification databases.The Supporting File Archive contains the table definitions and all other statements used to set up the test andqualification databases.2.2 Physical OrganizationThe physical organization of tables and indices within the test and qualification databases must be disclosed. If thecolumn ordering of any table is different from that specified in Clause2.3 or 2.4, it must be noted.The store sales, store returns, catalog sales, catalog returns, web sales, web returns and inventory arepartitioned. The partition columns for these tables respectively are ss sold date sk, sr returned date sk,cs sold date sk, cr returned date sk, ws sold date sk, wr returned date sk and inv date sk.2.3 Horizontal PartitioningIf any directives to DDLs are used to horizontally partition tables and rows in the test and qualification databases,these directives, DDLs, and other details necessary to replicate the partitioning behavior must be disclosed.Horizontal partitioning is used on store sales, store returns, catalog sales, catalog returns, web sales,web returns and inventory tables and the partitioning columns are ss sold date sk, sr returned date sk,cs sold date sk, cr returned date sk, ws sold date sk, wr returned date sk and inv date sk. The partitiongranularity is by day.2.4 ReplicationAny replication of physical objects must be disclosed and must conform to the requirements of Clause 2.5.3.All the objects are replicated by HDFS in 3 replications.Alibaba CloudE-MapReduceFull Disclosure ReportTPC-DS 3.0.0

16Clause 3: Scaling and Database Population3.1 Initial Cardinality of TablesThe cardinality (e.g., the number of rows) of each table of the test database, as it existed at the completion of thedatabase load (see Clause 7.1.2) must be disclosed.Table 3.1 lists the cardinality of each table as they existed upon completion of the build.Table 3.1 Initial Number of RowsTable NameRow Countcall center60catalog page50,000catalog returns14,398,600,958catalog sales143,996,902,621customer100,000,000customer address50,000,000customer demographics1,920,800date dim73,049household demographics7,200income 500reason75ship mode20store1,902store returnsstore sales28,794,006,308288,004,741,709time dim86,400warehouse30web page5,004web returns7,199,013,936web sales71,997,629,096web siteAlibaba CloudE-MapReduce96Full Disclosure ReportTPC-DS 3.0.0

173.2 Distribution of Tables and Logs Across MediaThe distribution of tables and logs across all media must be explicitly described using a format similar to thatshown in the following example for both the tested and priced systems.Table 3.2 Distribution of Tables and LogsServer NodeDisk TypeDisk driveDescription of Contentemr-header-1SSD Cloud Disk/dev/vdb (/mnt/disk1)logsemr-header-1SSD Cloud Disk/dev/vd{c,d}(/mnt/disk2 RAID-1)Hive metadata and HDFS metadataemr-worker-{1 - 40}Local SSD Disk/dev/vd{b,c,d,e}(/mnt/disk[1-4])logs, temp files, cache, replica oftable data (See Section 3.4)emr-header-1SSD Cloud Disk/dev/vdaOperating system, root directory,EMR softwareemr-worker-{1 - 40}Ultra Cloud Disk/dev/vdaOperating system, root directory,EMR softwareAll the Table contents were on HDFS. Table size on HDFS:177.1 K hdfs://emr-header-1:9000

Storage Device: 3 x 100 GB SSD Cloud Disk (data disk) 1 x 100 GB SSD Cloud Disk (boot disk) Server Configuration: Per node (ecs.i2g.16xlarge) Processors: Intel(R)Xeon(R) Platinum 8163CPU @ 2.50GHz, 33MB L3 Memory: 256 GB Network: Bandwidth:10.0 Gbps, Packet forwarding rate: 4, 0,000 Storage Device: 4 x 1788 GB NVMe SSD Local Disk (data disk)