A Quantitative Analysis Of High Performance Computing With Amazon's EC2 .

Transcription

Preliminary version. Final version appears In Proceedings of the 10th IEEE/ ACMInternational Conference on Grid Computing (Grid 2009). Oct 13-15 2009. Banff,Alberta, Canada.A Quantitative Analysis of High Performance Computing with Amazon’sEC2 Infrastructure: The Death of the Local Cluster?Zach Hill* and Marty Humphrey**Department of Computer Science, University of Virginia, Charlottesville, VA 22904* zach.hill@email.virginia.edu** humphrey@cs.virginia.edunearly-zero cost of entry. This allows scientists tosatisfy their high-performance computing needswithout resorting to entering their job in a queue withan hour or day long wait and the debugging headachesassociated with queue-based systems. While this iscertainly an exciting prospect, it must be recognizedthat cloud resources are not equivalent in processingpower and network capability to custom-built highperformance systems such as the large clusters at thenation’s supercomputing centers. For example, Walkerhas shown that EC2 simply cannot compete in terms ofraw network performance with a dedicated highperformance cluster at NCSA called abe[2].But, to date, there is an open issue that centers on thethousands of researchers who currently rely on theirown small clusters: How does EC2’s performancecompare against their existing (smaller, less cuttingedge, and much less expensive) cluster? Thesescientists typically do not use national-scalesupercomputers for a number of reasons, most notablypotentially idiosyncratic operating environments (e.g.,installed libraries), security issues, an unfamiliarqueuing system, a generally daunting list of queuedjobs, and a general belief that these shared clustersexist for “large jobs only”. Even those scientist that usenational-scale supercomputers for “production jobs”tend to do primary development, testing, anddebugging on smaller clusters before switching to theseshared supercomputers. EC2 resources are particularlyattractive for this purpose because of the pay-as-you-gomodel and complete flexibility in the software stack.Thus, the question remains: Ignoring the potentialbenefits of not having to run, upgrade, and generallymaintain a local cluster, does EC2 provide goodenough performance to compete with local clustersowned by individual researchers? It is arguably“intuitive” to believe that EC2 does not, given its wellknown use of virtualization and network shaping, but itthis actually true? The answer to this question couldhave profound impact on the broad researchcommunity.AbstractThe introduction of affordable infrastructure ondemand, specifically Amazon’s Elastic Compute Cloud(EC2), has had a significant impact in the business ITcommunity and provides reasonable and attractivealternatives to locally-owned infrastructure.Forscientific computation however, the viability of EC2has come into question due to its use of virtualizationand network shaping and the performance impacts ofboth. Several works have shown that EC2 cannotcompete with a dedicated HPC cluster utilizing highperformance interconnects, but how does EC2compare with smaller departmental and lab-sizedcommodity clusters that are often the primarycomputational resource for scientists? To answer thatquestion, MPI and memory bandwidth benchmarks areexecuted on EC2 clusters with each of the 64-bitinstance types to compare the performance of a 16node cluster of each to a dedicated locally-ownedcommodity cluster based on Gigabit Ethernet. Theanalysis of results shows that while EC2 doesexperience reduced performance, it is still viable forsmaller-scale applications.1. IntroductionThe introduction of virtualized remote on-demandcomputing resources, Amazon’s Elastic ComputeCloud (EC2) in particular, at reasonable prices hasmade many people question how computing should bedelivered in the future. Will it be cheaper to simplylease time on remote resources rather than purchasingand maintaining your own?The emerginginfrastructure-provider segment has been generallyfocused on business users and hosting web applicationsand services, but some researchers have begun to lookat the cloud as a viable solution for scientificcomputing as well [1][2][7][8][10][11].Utilizing cloud services for scientific computingopens up a new capability for many scientists: ondemand clusters with no queuing front-end and a1

To answer these questions we have quantitativelyevaluated EC2’s performance for the most commontype of scientific application: the MPI program. Wehave run benchmarks of the memory systems andnetworks to compare Amazon’s performance to that ofrepresentative local departmental clusters.Thepurpose of the benchmarks is to evaluate EC2’sspecific performance capabilities relevant to acommodity cluster.Our results show that EC2competes surprisingly well with a commodity, GigabitEthernet-based, cluster in terms of memoryperformance and MPI performance. While we agreethat the network performance of EC2 is not equivalentto that of a high-performance dedicated cluster withMyrinet or Infiniband interconnects, in terms of latencyin particular, it does provide enough performance to beuseful to a large class of scientific users which use thesmall to mid-sized departmental and lab commodityclusters often found in academia.The rest of this paper will be organized as follows:the Related Work section will discuss other evaluationsof the performance of EC2 and virtualizedenvironments in general; the Evaluation Setup sectionwill describe how the EC2 cluster was setup andconfigured as well as the configuration of thecommodity clusters used as baselines; the Evaluationsection will present our results from the variousperformance benchmarks; the Discussion section willaddress some observations made during this work aswells problems encountered and possible future work;finally, we give our conclusions on using EC2 clustersfor MPI-based HPC computation.various EC2 instance types against each other to get alarger picture of EC2’s offerings.Others have evaluated EC2 and its associated blobstorage, the Simple Storage Service (S3) [7], forapplicability in data intensive eScience workloads[8][9]. We do not consider the I/O component in ourevaluation and instead focus on CPU and networkbound workloads based on message passing. We alsodo not primarily consider the question of costeffectiveness of cloud-based clusters over locallyowned clusters. While we do feel that a performanceper-dollar comparison is valuable when deciding howto implement infrastructure, we leave that for futurework as it can be very application specific.General evaluations of the performance of virtualizedsystems such as Xen[10] have been studied extensivelyand Xen has been shown to impose negligibleoverheads in both micro an macro benchmarks[11][12]. However, these were evaluations of Xenitself in a controlled cluster environment whereas weare evaluating Amazon’s specific implementation andcustomizations of Xen and their overall productoffering including networking. Amazon’s multiplexingof physical resources and networks introduces sourcesof performance limiters not found in these previousworks.3. Evaluation SetupOur experimental setup included clusters composedof each EC2 64-bit instance type as well as our local32-bit cluster. For creating and managing our EC2clusters we utilized an existing project which providedPython scripts to handle most operations. We describethat project and the specifics of the EC2 instance typesbelow. We also describe the configuration of our localresources that we used to compare the EC2 resultsagainst.2. Related WorkEdward Walker’s work [2] examines the feasibilityof using EC2 for HPC, but compares it to high-endcompute clusters at NCSA. This comparison pits EC2against high-end clusters utilizing Infinibandinterconnects. His work focuses on network latencyand bandwidth in particular and unsurprisingly findsthat the NCSA cluster has network latency that is morethan an order of magnitude lower than that seen at EC2.We are focusing our evaluation on comparing EC2based clusters to commodity clusters utilizing GigabitEthernet interconnects as would more likely be foundin departmental and research-lab sized systems.C. Evangelinos and C. Hill performed an evaluationof running couple ocean-atmosphere simulations onEC2 [1]. Their work focuses on 32-bit applicationsand as such they only examine two out of the five EC2instance types. We have focused primarily on 64-bitplatforms as this used for the majority of scientificcomputing. We also compare the performance of the3.1. EC2 OverviewAmazon’s EC2 service has become the standardbearer for Infrastructure-as-a-Service (IaaS) providers.It is the most popular and provides many differentservice levels. Machine instances are available in thefollowing configurations:InstanceCPU MemDisk I/OCostType(GB)(GB) Perf. /Insthr.**M1.Small1*1.7160Mod 0.1032bit,1 coreECU.M1.Large2*7.5850High 0.4064bit,2 cores ECUM1.XLarge2*151690 High 0.8064bit,4 cores ECU2

the 32-bit instances may be more cost-effective 64-bitis the standard for scientific computations as well asproviding substantially more memory, which is often alarge performance enhancer for CPU-boundapplications. For the memory bandwidth tests weutilized a single node of each instance type.C1.Medium2.5* 1.7350Mod 0.2032bit,2 cores ECU.C1.XLarge2.5* 71690 High 0.8064bit,8 cores ECUTable 1. Amazon EC2 Instance Types*1 ECU (EC2 Compute Unit) is equivalent to 1.0-1.2 GHz2007 Opteron or 2007 Xeon processor capacity [6]. **Indicates Linux pricing only, Windows Server prices arehigher.3.2. Local ResourcesTo give context and a baseline to the EC2performance numbers we ran the same benchmarks onour locally owned cluster. The Sunfire cluster iscomposed of nodes with 32-bit Intel Xeon processorshas the following specifications: 2 Physical CPUs IntelXeon 2.80Ghz with HyperThreading (CPU Family 15,model 2, stepping 7); 512KB L2 cache, 3GB RAMSDRAM, 400Mhz FSB, and Gigabit Ethernet NIC.The mpicc version was 1.0.8p1 and gcc was version4.2.4. The platform was x86 32-bit and the MPIlibrary used was OpenMPI 1.3 [15].The other locally-owned resource we use forcomparison is Camillus. Camillus is a 64-bit dual-CPUIntel Xeon E5345 Quad-Core machine (8 cores total)with a 1333 Mhz FSB, 16GB DDR2 RAM, and aGigabit Ethernet NIC. We use this machine forcomparison in the memory-bandwidth benchmarkssince it represents current CPU designs better than theSunfire cluster nodes do.Amazon utilizes a customized version of Xen to hostthe AMIs. Instance operating systems available are:Windows Server 2003, Red Hat Linux, Fedora Core,openSUSE, Gentoo, Oracle Enterprise Linux, Ubuntu,and Debian Linux. While each instance has storageassociated with it, the local instance storage is notpersistent across instance termination, so othersolutions, such as Amazons Simple Storage Service(S3) or Elastic Block Store (EBS) are required.However, for this work we did not to utilize either ofthose services since we are only evaluating the networkand processing capabilities of EC2 rather than thestorage capabilities.In terms of service agreement terms, EC2’s serviceagreement states an availability level of 99.95% andAmazon will credit your account if it falls below thatduring a 365 day period. They also claim an internalnetwork bandwidth of 250 Mbps regardless of instancetype, although this is not included in the officialinstance type specification.After working to build our own Amazon MachineImages (AMIs) we found a project called ElasticWulf[3], which in addition to providing a pair of AMIs thathave multiple MPI libraries (MPICH2, LAM, andOpenMPI) and OpenMP already installed, alsoincluded scripts to start, stop, and monitor a cluster ofEC2 machines.ElasticWulf requires only theinstallation of the Amazon command line tools and APIlibraries. It includes the basics to get a cluster up andrunning that has a shared NFS directory as well as theMPI runtime, various Python MPI libraries and tools,and Ganglia for monitoring the cluster via the web.The specific AMI’s that it uses are based on FedoraCore 6 and are 64-bit. The AMI numbers are: amie813f681 and ami-eb13f682 for the master node andthe worker nodes respectively. ElasticWulf also workswith any AMI, not just the supplied ones. There arealso alternative 32-bit images provided although theydo not include the NFS share and Ganglia. Ourspecific configuration was using MPICH2, mpiccversion 1.0.6 specifically, on top of Fedora Core 6 andusing gcc 4.1.2. for the x86 64 architecture.For our MPI tests we constructed a cluster of each ofthe instance types which are 64-bit. We feel that while4. EvaluationTo evaluate the performance of EC2 clusters againstthat of a small commodity cluster we created clusters ofeach EC2 instance type and benchmarked theirperformance using the STREAM memory bandwidthbenchmark and then Intel’s MPI Benchmark version3.2[14].We did not utilize a CPU ALU-opperformance benchmark because Xen’s CPUperformance has been studied extensively and inscientific applications which require a cluster theperformance limiter is usually the interconnect or thememory bandwidth on each node since CPUperformance has increase much more rapidly thaneither network or memory performance.4.1. Memory BandwidthMany scientific applications involve operations onlarge amounts of data stored in memory. Thus, it isimportant to evaluate the memory-bandwidth of theEC2 instance types in order to see how they compare tonon-virtualized resources. In this case we did notdirectly compare an EC2 instance’s performance to theperformance of the same CPU in a non-virtualizedenvironment. Others have examined the performanceoverhead of virtualization [11][12] on performance,3

and part of the abstraction of EC2’s “Cloud” paradigmis that the underlying resources can change arbitrarily,so no single processor can be designated as the EC2standard.The results of running STREAM with array length of10 million integers follow:Machine dwidth performance even thought the machines arevirtualized.4.2. Intel MPI Benchmarks v3.2The most useful measure of the performance of acluster is how well it performs on the specific code thatit is being used for. Since a large proportion ofscientific applications utilize MPI for inter-processcommunication, we have tested the various EC2instance types using the Intel MPI Benchmarks (IMB)version 3.2[14]. We present here the results for theIMB-MPI1 suite of benchmarks which evaluate theMPI v1 specification. We believe that the commonoperations in most applications are covered in this suiteand that the extensions provided in the MPI 2specification while useful, would not paint asignificantly different picture of the relativeperformance of EC2 instances compared to our localcluster. Thus, we have omitted those results for bothbrevity and clarity.Each of the benchmarks presented here measures theaverage latency for messages passed of a given size.Each data point is an average of multiple runs (1000for the smaller data points up to 32K and 10 for the 4MB messages). These averages are reported by theIMB code itself. We do not present the minimums andmaximums for sake of clarity.There are three classes of benchmark: SingleTransfer, Parallel Transfer, and Collective.ThePingPong and PingPing benchmarks (Figures 1 - 4)compose the Single Transfer class while SendRecv andExchange (Figures 5 - 8) compose the Parallel Transferclass with the remaining benchmarks (Figures 8 – 16)composing the Collective class. More informationabout each benchmark including the specificcommunication patterns in each can be found on theIntel MPI Benchmarks website [14].Each of the benchmarks was run such that only asingle MPI process was run on each node. Thus, eventhough some EC2 instance types have multiple virtualcores we did not match the number of MPI processes tothe number of cores on the virtual machine because weare focusing on the network performance and colocation of MPI processes on nodes would not measurethat accurately. Also, we were interested to see ifrequesting larger node types might reduce thepossibility of being co-located with another user’sinstances and thus reduce or eliminate contention forthe I/O system, including the network.In the following figures we show the results ofrunning each benchmark on a cluster of 16 nodes ofeach EC2 64-bit instance type as well as our localcluster. The Single Transfer benchmarks (Figures 1 -1 Thread Bandwidth in GB/sCopyScale AddTriad2.0581.777 1.8681.7252.5512.394 2.4342.1782.8652.852 3.1143.0972.8492.840 3.1263.1202.8342.830 3.1713.160Sunfire1.3411.325 1.6631.662Table 2. Single Thread CPU Memory BandwidthThe performance of the EC2 nodes is similar toCamillus, which is the most modern processor of thelocally-owned machines. We see a definite advantagein the newer DDR and DDR2 memory architecturesfound and higher front-side bus clock-rates in the EC2nodes as opposed to the older non-DDR systems of theSunfire cluster and its 400Mhz FSB.MachineSpecN Threads Bandwidth in GB/sNCopy Scale AddTriadM1.Large3.244 3.186 3.564 3.508 2M1.XLarge3.748 3.936 3.717 3.714 4C1.Medium4.241 4.494 4.840 4.796 2C1.XLarge4.807 4.788 5.149 5.161 8Camillus4.653 4.661 4.895 5.007 8Sunfire1.441 1.458 1.626 1.622 2*Table 3.N threads (cores) CPU MemoryBandwidth. *Sunfire is a Xeon processor withHyperThreading not 2 physical cores. These resultsare only for comparison.It is worth noting here that I am running 64-bitprograms, not 32-bit as Evangelinos and Hill did.Thus, there are some discrepancies in the bandwidthnumbers.However, we see that the EC2 nodes perform as wellor better than both the Centurion and Generals nodes inmost of the tests. It is no surprise that they outperformthe Centurion nodes due to their newer memory(DDR2), but they also hold their own against Camillus,which is a relatively new machine itself. This showsthat we can expect reasonable to good memory4

4) utilize only 2 nodes in each cluster for theirmeasurements, but all other utilize the full 16 nodes.We conducted runs of the benchmarks using 4 and 8nodes of each cluster as well but found that the resultswere not significantly different than for those run with16 nodes.Figure 4. Bandwidth of PingPingIn the Single Transfer benchmarks (Figures 1 - 4) wesee that the EC2 instance type is not a great determinerof performance. The message latency of the EC2clusters, while not as low as that of the Sunfire cluster,is well below an order of magnitude lower and isgenerally below 2X higher than that of the Sunfirecluster.Figure 1. Average Latency of PingPongFigure 2. Average Latency of PingPingFigure 5. Average Latency of SendRecvFigure 3. Bandwidth of PingPongFigure 6. Average Latency of Exchange5

Figure 9. Average Latency of AllReduceFigure 7. Bandwidth of SendRecvFigure 10. Average Latency of ReduceFigure 8. Bandwidth of ExchangeThe Parallel Transfer benchmarks (Figures 3 & 4)show that again all EC2 clusters trail the local cluster,Sunfire, in terms of latency, but not by large margins.The spike seen in Figure 4 at the 2MB message size forthe EC2 HighCPU XLarge instance-type cluster isinteresting in that we do not expect to have I/Operformance isolation issues in this instanct type due tothe fact that it occupies 8 cores which would fill a dualCPU quad-core server. Thus, this abnormality couldbe either due to network conention, most likely, orevidence that performance isolation between hostedVMs is still an issue even when utilizing 8 virtualcores. Note that the results are averaged over 6 runs asopposed to the 20 that are normal used. This indicatesthat the benchmark experienced difficulty completingall the repititions which most likely indicates networkproblems.Figure 11. Average Latency of Reduce-scatterFigure 12. Average Latency of AllGather6

Figure 13. Average Latency of AlltoAllFigure 17. Average Latency of BarrierThe results for the Collective class of benchmarksgenerally show that the local Sunfire cluster still has adistinct advantage in network latency performance, andthat while the EC2 “High-CPU X-Large” instance typeusually has the highest performance of the EC2 clustersit is not a significantly better performer despite itsvirtual CPUs clock rate advantage. Also of note in allthe results is that the “XLarge” EC2 instance type wasalmost always the worst performer despite having thesame virtual clock rate as the “Large” instance type andtwice as many virtual cores. While we did not expectthe core count to directly impact the MPI results,because we didn’t schedule multiple processes on asingle node, we did hope to see that requesting morecores would improve performance by reducing thepossibility of a co-hosted instance from another userinterfering with the I/O of our instances. This did notappear to be the case, however.Figure 14. Average Latency of Scatter6. DiscussionThere are several influences on performance whichare of concern in virtualized environments such asEC2. These include: cache behavior, buffer-copycosts, and I/O sharing between instances. Cachebehavior is critically important to the performance ofprograms, and good cache management can result insignificant performance gains. The difficulty in avirtualized environment is that it is not clear how thecaches are shared and whether cache-pollution ispossible from other VMs. This is particularly the casefor multi-core CPUs with shared L2 caches such asmost Intel processors. Since the caches are notexplicitly controlled by software (either the applicationor the operating system) this cannot be controlled bythe VMM. Thus, in a pathological case where anotherVM instance is running on the same physical machineand is using a lot of the shared cache a user may seesignificantly slower performance than he would on adedicated machine.Figure 15. Average Latency of GatherFigure 16. Average Latency of Bcast7

useful for debugging and smaller scale computations.Using preexisting tools we were able to create andsetup a cluster within minutes and using only threeshell scripts.This is the beauty of EC2, itsconfigurability and ease of use. We believe it wouldmake a suitable small scale cluster for research groups,labs, and departments.One potential solution to this is to simply pay for alarger instance in EC2, although our results did notreinforce this idea. The larger instances occupy moreof the CPUs and cores than the smaller instances do.For example, the C1.XLarge instance type has 8 cores,which presumably is a single physical machine withtwo quad-core processors dedicated solely to thatinstance. Thus, even if you don’t need more cores, byreserving them you may be able to stabilizeperformance since other VMs will not be hosted on thesame physical machine. However, this cannot beguaranteed in the future as physical machines will havemore cores and thus co-hosting my again take placeeven for large instance types.A similar problem exists with the I/O subsystem of aphysical machine. It necessarily must be multiplexedacross the instances being hosted. Thus, one VM’s I/Operformance may affect that of another user. It is notclear whether Amazon has addressed this in theirversion of the Xen hypervisor, but this would requirecontrolling I/O request routing to the hardware in thehost operating system. The same solution as for thecache behavior may work for this case as well. Byreserving larger instances you limit the amount ofexternal interference that your VM instance canreceive.References[1] Constantinos Evangelinos and Chris N. Hill,“Cloud Computing for Parallel Scientific HPCApplications: Feasibility of Running CoupledAtmosphere-Ocean Climate Models on Amazon’sEC2”, 2008, CCA-08.[2] Edward Walker, “Benchmarking Amazon EC2 forHigh-Performance Scientific Computing”, 2008,Usenix Login, Vol. 33, No. 5.[3] STREAM Project.http://www.cs.virginia.edu/stream/[4] Elasticwulf Project.http://code.google.com/p/elasticwulf[5] Amazon EC2. http://aws.amazon.com/ec2[6] Amazon S3. http://aws.amazon.com/s3[7] Hoffa, G., Guarang, M., Freeman, T., Deelman, E.,Keahey, K., Berriman, B., Good, J. “On the Useof Cloud Computing for Scientific Workflows”.[8] Deelman, E., Singh, G., Livny, M., Berriman, B.,Good, J. “The Cost of Doing Science on theCloud: The Montage Example”.[9] Barham, P., Dragovic, B., Frasier, K., Hand, S.,Harris, T., Ho, A., Neugebauer, R., Pratt, I.,Warfield, A. “Xen and the Art of Virtualization”.2003, SOSP ’03. October 2003.[10] Youseff, L., Wolski, R., Gorda, B., Krintz, C.“Evaluating the Performance Impact of Xen onMPI and Process Execution in HPC Systems”.SuperComputing ’08.[11] Youseff, L., Wolski, R., Gorda, B., Krintz, C.“Paravirtualization for HPC Systems”.[12] ich2/[13] LAM MPI. http://www.lam-mpi.org/[14] Intel MPI Benchmarks ona/eng/219848.htm[15] Open MPI . ConclusionsThe emergence of EC2 and other cloud resourcehosting platforms has enabled scientists to createclusters of machines on-demand and use them for smallto medium scale computational science problems. Weshowed that while EC2 clusters are not the highestperformers, they do provide reasonable performancewhich when coupled with their low cost and ease of usemay provide an attractive alternative to dedicatedclusters.EC2 is not the best platform for tightly-coupledsynchronized programs with frequent but smallcommunication between nodes. The high latency killsperformance.However, the bandwidth availablebetween nodes suggests that less frequent but quitelarge data exchanges are acceptable and thus redundantcomputation may be a way to extract extraperformance.In all, EC2 is not a high-performance system whichwill replace specialized clusters any time soon, but itdoes offer on-demand capabilities which are very8

demand, specifically Amazon's Elastic Compute Cloud (EC2), has had a significant impact in the business IT community and provides reasonable and attractive alternatives to locally-owned infrastructure. For scientific computation however, the viability of EC2 has come into question due to its use of virtualization