Rocks Clusters And Object Storage - Stanford University

Transcription

Institute for Computationaland Mathematical EngineeringRocks Clusters and Object StorageSteve JonesTechnology Operations ManagerInstitute for Computational and Mathematical EngineeringStanford UniversityLarry JonesVice President, Product MarketingPanasas Inc.

Institute for Computationaland Mathematical EngineeringResearch Groups Flow Physics and Computation Aeronautics and Astronautics Chemical Engineering Center for Turbulence Research Center for Integrated Turbulence Simulations Thermo Sciences DivisionFunding Sponsored Research(AFOSR/ONR/DARPA/DURIP/ASC)

Institute for Computationaland Mathematical EngineeringActive collaborations with the LabsBuoyancy driven instabilities/mixing - CDP for modeling plumes(Stanford/SNL)LES Technology - Complex Vehicle Aerodynamics using CDP(Stanford/LLNL)Tsunami modeling - CDP for Canary Islands Tsunami Scenarios(Stanford/LANL)Parallel I/O & Large-Scale Data Visualization - UDM integrated in CDP(Stanford/LANL)Parallel Global Solvers - HyPre Library integrated in CDP(Stanford/LLNL)Parallel Grid Generation - Cubit and related libraries(Stanford/SNL)Merrimac - Streaming Supercomputer Prototype(Stanford/LLNL/LBNL/NASA)

Institute for Computationaland Mathematical EngineeringAffiliates Program

Institute for Computationaland Mathematical EngineeringThe ResearchMOLECULES TO PLANETS !

Institute for Computationaland Mathematical EngineeringTsunami ModelingPreliminary calculationsPreliminary calculationsWard & Day, Geophysical J. (2001)

Institute for Computationaland Mathematical EngineeringLandslide Modeling Extends existing Lagrangianparticle-tracking capability in CDP Collision model based on thedistinct element method* Originally developed for the analysisof rock mechanics problems* Cundall P.A., Strack O.D.L., A discrete numericalmodel for granular assemblies, Géotechnique 29, No1, pp. 47-65.

Institute for Computationaland Mathematical Engineering“Some fear flutter because they don’t understand it,and some fear it because they do.”-von Karman-

Institute for Computationaland Mathematical Engineering9/12/97

Institute for Computationaland Mathematical EngineeringLimit Cycle Oscillation

Torsional Damping Ratio (%)Torsional Frequency (Hz)Institute for Computationaland Mathematical Engineering9876543210.60.811.21.4Mach Number43210-10.6s orset suin cesm ro80.81 41.2 p1.4042Mach Numberon3D Simulation (Clean Wing)Flight Test Data (Clean Wing)

Institute for Computationaland Mathematical EngineeringDatabases?Desert Storm(1991)Iraq War(2003)400,000 configurations to be flight tested

Damping Coefficient (%) -- 1st TorsionInstitute for Computationaland Mathematical EngineeringPotential43.53Flight Test2.52FOM (1,170 s.)1.5TP-ROM (5 s.)10.500.60.81Mach Number1.2

Institute for Computationaland Mathematical EngineeringA Brief Introduction toClustering and Rocks

Institute for Computationaland Mathematical EngineeringBrief History of Clustering (very brief) NOW pioneering the vision for clusters of commodity processors–David Culler (UC Berkeley) started in early 90’s–SunOS / SPARC–First generation of Myrinet, active messages–Glunix (Global Unix) execution environmentBeowulf popularized the notion and made it very affordable–Tomas Sterling, Donald Becker (NASA)–Linux 2005 UC Regents

Institute for Computationaland Mathematical EngineeringTypes of Clusters Highly Available (HA)–Generally small, less than 8 nodes–Redundant components–Multiple communication paths–This is NOT RocksVisualization clusters–Each node drives a display–OpenGL machines–This is not core Rocks–But there is a Viz RollComputing (HPC clusters)–AKA Beowulf–This is core Rocks 2005 UC Regents

Institute for Computationaland Mathematical EngineeringDefinition: HPC Cluster Architecture 2005 UC Regents

Institute for Computationaland Mathematical EngineeringThe Dark Side of Clusters Clusters are phenomenal price/performance computational engines –Can be hard to manage without experience–High performance I/O is still unresolved–Finding out where something has failed increases at least linearly as cluster size increases Not cost-effective if every cluster “burns” a person just for care and feeding Programming environment could be vastly improved Technology is changing rapidly–Scaling up is becoming commonplace (128-256 nodes) 2005 UC Regents

Institute for Computationaland Mathematical EngineeringMinimum ComponentsPowerLocal HardDriveEtherneti386 (Athlon/Pentium)x86 64 (Opteron/EM64T)ia64 (Itanium) server 2005 UC Regents

Institute for Computationaland Mathematical EngineeringOptional Components High performance network– Myrinet– Infiniband (SilverStorm or Voltaire) Network addressable power distribution unit Keyboard/video/mouse network not required– Non-commodity– How do you manage your network? 2005 UC Regents

Institute for Computationaland Mathematical EngineeringThe Top 2 Most Critical Problems The largest problem in clusters is software skew– When SW configuration on some nodes is different than on others– Small differences (minor version numbers on libraries) can cripple aparallel program The second most important problem is adequate job control of theparallel process– Signal propagation– Cleanup 2005 UC Regents

Institute for Computationaland Mathematical EngineeringRocks (Open source clustering distribution) Technology transfer of commodity clustering to application scientists–“Make clusters easy”–Scientists can build their own supercomputers and migrate upto national centers as needed Rocks is a cluster on a CD–Red Hat Enterprise Linux (opensource and free)–Clustering software (PBS, SGE, Ganglia, NMI)–Highly programmatic software configuration managementCore software technology for several campus projects–BIRN, Center for Theoretical Biological Physics, EOL, GEON, NBCR, OptlPuter First SW release release Nov. 2000 Supports x86, Opteron, EMT64, and Itanium 2005 UC Regents

Institute for Computationaland Mathematical EngineeringPhilosophy Caring and feeding for a system is not fun System administrators cost more than clusters –1 TFLOP cluster is less than 200,000 (US)–Close to actual cost of a full-time administratorSystem administrator is the weakest link in the cluster–Bad ones like to tinker–Good ones still make mistakes 2005 UC Regents

Institute for Computationaland Mathematical EngineeringPhilosophy (continued) All nodes are 100% automatically configured–Zero “hand” configuration–This includes site-specific configurationRun on heterogeneous standard high volume components–Use components that offer the best price/performance–Software installation and configuration must support differenthardware–Homogeneous clusters do not exist–Disk imaging requires homogeneous cluster 2005 UC Regents

Institute for Computationaland Mathematical EngineeringPhilosophy (continued) Optimize for installation–Get the system up quickly–In a consistent state–Build supercomputers in hours not monthsManage through re-installation–Can re-install 128 nodes in under 20 minutes–No support for on-the-fly system patchingDo not spend time trying to maintain system consistency–Just re-install–Can be batch drivenUptime in HPC is a myth–Supercomputing sites have monthly downtime–HPC is not HA 2005 UC Regents

Institute for Computationaland Mathematical EngineeringRocks Basic Approach1.2.Install a frontend–Insert Rocks Base CD–Insert Roll CDs (optional components)–Answer 7 screens of configuration data–Drink coffee (takes about 30 minutes to install)Install compute nodes–Login to frontend–Execute insert-ethers–Boot compute node with Rocks Base CD (or PXE)– Condor–Insert-ethers discovers nodes– Grid (based on NMI R4)–Go to step 3 Optional Rolls– Intel (compilers)3.Add user accounts– Java– SCE (developed in Thailand)– Sun Grid Engine4.Start computing– PBS (developed in Norway)– Area51 (security monitoring tools)– Many others 2005 UC Regents

Institute for Computationaland Mathematical EngineeringThe Clusters

Institute for Computationaland Mathematical EngineeringIceberg 600 Processor Intel Xeon 2.8GHz Fast Ethernet–Install Date 2002 1 TB Storage Physical installation - 1 week Rocks installation tuning - 1 week

Institute for Computationaland Mathematical EngineeringIceberg at Clark Center One week to move and rebuild thecluster Then running jobs again

Institute for Computationaland Mathematical EngineeringTop 500 Supercomputer

Institute for Computationaland Mathematical EngineeringNivation 164 Processor Intel Xeon 3.0GHz 4GB RAM per node Myrinet Gigabit Ethernet Two 1 TB NAS Appliances 4 Tools Nodes

Institute for Computationaland Mathematical EngineeringCampusBackboneEliminated BottlenecksRedundancyFrontend Server400MBytes/secTools-1Tools-2Tools-3Tools-4NFS ApplianceGigE NetNFS ge Bottleneck/Single Point of Failure

Institute for Computationaland Mathematical EngineeringPanasas Integration in less than 2 hours Installation and configuration of Panasas Shelf - 1 hour Switch configuration changes for link aggregation - 10 minutes Copy RPM to S - 1 minute create/edit extend-compute.xml - 5 minutes# Add panfs to fstabREALM 10.10.10.10mount flags "rw,noauto,panauto"/bin/rm -f /etc/fstab.bak.panfs/bin/rm -f /etc/fstab.panfs/bin/cp /etc/fstab /etc/fstab.bak.panfs/bin/grep -v "panfs://" /etc/fstab /etc/fstab.panfs/bin/echo "panfs:// REALM:global /panfs panfs mount flags/bin/mv -f /etc/fstab.panfs /etc/fstab/bin/sync00" /etc/fstab.panfs/sbin/chkconfig --add panfs/usr/local/sbin/check panfsLOCATECRON /etc/cron.daily/slocate.cronLOCATE /etc/sysconfig/locateLOCTEMP /tmp/slocate.new/bin/cat LOCATECRON sed "s/,proc,/,proc,panfs,/g" LOCTEMP/bin/mv -f LOCTEMP LOCATECRON/bin/cat LOCATECRON sed "s/\/afs,/\/afs,\/panfs,/g" LOCTEMP/bin/mv -f LOCTEMP LOCATECRON [root@rockscluster]# rocks-dist dist ; cluster-fork ‘/boot/kickstart/cluster-kickstart’ - 30 minutes /etc/auto.homeuserX -fstype panfs panfs://10.x.x.x/home/userX - script it to save time

Institute for Computationaland Mathematical EngineeringBenchmarking Panasas using bonnie #!/bin/bash#PBS -N BONNIE#PBS -e Log.d/BONNIE.panfs.err#PBS -o Log.d/BONNIE.panfs.out#PBS -m aeb#PBS -M hpcclusters@gmail.com#PBS -l nodes 1:ppn 2#PBS -l walltime 30:00:00PBS O WORKDIR '/home/sjones/benchmarks'export PBS O WORKDIR### --------------------------------------### BEGINNING OF EXECUTION### --------------------------------------echo The master node of this job is hostname echo The job started at date echo The working directory is echo PBS O WORKDIR echo This job runs on the following nodes:echo cat PBS NODEFILE ### end of information preamblecd PBS O WORKDIRcmd "/home/tools/bonnie /sbin/bonnie -s 8000 -n 0 -f -d /home/sjones/bonnie"echo "running bonnie with: cmd in directory " pwd cmd & PBS O WORKDIR/Log.d/run9/log.bonnie.panfs. PBS JOBID

Institute for Computationaland Mathematical EngineeringNFS - 8 NodesVersion 1.03MachineSizecompute-3-82 8000Mcompute-3-81 8000Mcompute-3-80 8000Mcompute-3-79 8000Mcompute-3-78 8000Mcompute-3-77 8000Mcompute-3-74 8000Mcompute-3-73 8000MSequential Output-Block-RewriteK/sec %CP K/sec %CP2323 0348 02333 0348 02339 0349 02204 0349 02285 0354 02192 0350 02292 0349 02309 0358 0Sequential Input- Random Seeks-BlockK/sec %CPK/sec %CP5119 151.3 05063 151.3 04514 152.0 04740 199.8 03974 067.9 05282 046.8 05112 145.4 04053 064.6 017.80MB/sec for concurrent write using NFS with 8 dual processor jobs36.97MB/sec during read process

Institute for Computationaland Mathematical EngineeringPanFS - 8 NodesVersion 1.03MachineSizecompute-1-18. 8000Mcompute-1-17. 8000Mcompute-1-16. 8000Mcompute-1-15. 8000Mcompute-1-14. 8000Mcompute-1-13. 8000Mcompute-1-12. 8000Mcompute-1-11. 8000MSequential Output-Block-RewriteK/sec %CP K/sec %CP20767 8 4154 319755 7 4009 319774 7 4100 319716 7 3878 319674 7 4216 319496 7 4236 319579 7 4117 319688 7 4038 3Sequential Input- Random Seeks-BlockK/sec %CPK/sec %CP24460 772.8 024588 7116.5 023597 796.4 025384 8213.6 124495 772.8 024238 771.0 023731 797.1 024195 8117.7 0154MB/sec for concurrent write using PanFS with 8 dual processor jobs190MB/sec during read process

Institute for Computationaland Mathematical EngineeringNFS - 16 NodesVersion 0M8000M8000M8000M8000M8000M8000M8000MSequential Output-Block-RewriteK/sec %CP K/sec %CP1403 0127 01395 0132 01436 0135 01461 0135 01358 0135 01388 0127 01284 0133 01368 0128 01295 0131 01031 0176 01292 0128 01307 0129 01281 0130 01240 0135 01273 0128 01282 0131 0Sequential Input- Random Seeks-BlockK/sec %CPK/sec %CP2210 0274.0 21484 072.1 01342 049.3 01330 053.7 01291 054.7 02417 045.5 01608 071.9 02055 054.2 01650 047.4 0737 018.3 02124 0104.1 02115 048.1 01988 092.2 11488 054.3 02446 052.7 01787 052.9 020.59MB/sec for concurrent write using NFS with 16 dual processor jobs27.41MB/sec during read process

Institute for Computationaland Mathematical EngineeringPanFS - 16 NodesVersion 0M8000M8000M8000M8000M8000M8000M8000MSequential Output-Block-RewriteK/sec %CP K/sec %CP14330 5 3392 214603 5 3294 214414 5 3367 29488 3 2864 28991 3 2814 29152 3 2881 29199 3 2865 214593 5 3330 29973 3 2797 29439 3 2879 29307 3 2834 29774 3 2835 215097 5 3259 214453 5 2907 214512 5 3301 214558 5 3256 2Sequential Input- Random Seeks-BlockK/sec %CPK/sec %CP28129 954.1 030990 960.3 028834 955.1 017373 5121.4 021843 7116.5 020882 680.6 020783 685.2 029275 961.0 018153 5121.6 022270 764.9 021150 699.1 020726 677.1 032705 1060.6 036321 11126.0 032841 1060.4 033096 1062.2 0187MB/sec for concurrent write using PanFS with 8 dual processor jobs405MB/sec during read processCapacity imbalances on jobs - 33MB/sec increase from 8 to 16 job run

Institute for Computationaland Mathematical EngineeringPanasas statistics during write process[pancli] sysstat storageIPCPU Disk Ops/s KB/sCapacity (GB)Util UtilInOut Total Avail Reserved10.10.10.250 55% 22% 127 22847 272 485 36710.10.10.253 60% 24% 140 25672 324 485 36510.10.10.245 53% 21% 126 22319 261 485 36510.10.10.246 55% 22% 124 22303 239 485 36610.10.10.248 57% 22% 134 24175 250 485 36910.10.10.247 52% 21% 124 22711 233 485 36610.10.10.249 57% 23% 135 24092 297 485 36710.10.10.251 52% 21% 119 21435 214 485 36610.10.10.254 53% 21% 119 21904 231 485 36710.10.10.252 58% 24% 137 24753 300 485 366Total "Set 1" 55% 22% 1285 232211 2621 4850 366448484848484848484848480Sustained BW 226 MBytes/Sec during 16 1GB concurrent writes

Institute for Computationaland Mathematical EngineeringPanasas statistics during read process[pancli] sysstat storageIPCPU Disk Ops/s KB/sCapacity (GB)Util UtilIn Out Total Avail Reserved10.10.10.250 58% 95% 279 734 21325 485 35510.10.10.253 60% 95% 290 727 22417 485 35310.10.10.245 54% 92% 269 779 19281 485 35310.10.10.246 59% 95% 290 779 21686 485 35410.10.10.248 60% 95% 287 729 22301 485 35710.10.10.247 52% 91% 256 695 19241 485 35610.10.10.249 57% 93% 276 708 21177 485 35610.10.10.251 49% 83% 238 650 18043 485 35510.10.10.254 45% 82% 230 815 15225 485 35510.10.10.252 57% 94% 268 604 21535 485 354Total "Set 1" 55% 91% 2683 7220 202231 4850 354848484848484848484848480Sustained BW 197 MBytes/Sec during 16 1GB concurrent sequential reads

Institute for Computationaland Mathematical EngineeringThis is our typical storage utilization with the cluster at 76%[pancli] sysstat storageIPCPU DiskUtil Util10.10.10.250 6% 5%10.10.10.253 5% 4%10.10.10.245 4% 3%10.10.10.246 6% 4%10.10.10.248 5% 3%10.10.10.247 3% 3%10.10.10.249 5% 3%10.10.10.251 4% 3%10.10.10.254 4% 3%10.10.10.252 4% 3%TotalOps/s KB/sIn Out35 29235 37629 25028 26227 2341148 25846 21632 25634 337Total4095283433732902365267349499Capacity (GB)Avail Reserved485 37048485 36848485 36848485 36948485 37248485 37048485 37148485 36948485 37048485 370484% 3% 315 2482 3425 4850 3697480sustained BW 2.42 Mbytes/sec in - 3.34 Mbytes/sec out[root@frontend-0 root]# showqACTIVE 2:504:00:55:3699:22:50:42Sun May 15 18:20:23Sun May 15 18:58:30Fri May 13 20:35:06Fri May 13 20:48:34Sun May 15 10:08:05Sun May 15 18:20:51Sun May 15 gRunningRunningRunningRunningRunningRunning7 Active Jobs 125 of 164 Processors Active (76.22%)65 of 82 Nodes Active(79.27%)

Institute for Computationaland Mathematical EngineeringPanasas Object Storage

Institute for Computationaland Mathematical EngineeringRequirements for Rocks Performance– High read concurrency for parallel application and data sets– High write bandwidth for memory checkpointing, interim and final output Scalability– More difficult problems typically means larger data sets– Scaling cluster nodes requires scalable IO performance Management– Single system image maximizes utility for user community– Minimize operations and capital costs

Institute for Computationaland Mathematical EngineeringShared Storage: The Promise Shared storage cluster computing– Compute anywhere modelCluster Compute Nodes Partitions available globally; noreplicas required (shared datasets) No data staging required No distributed data consistency issues Reliable checkpoints; applicationreconfiguration Results gateway– Enhanced reliability via RAID– Enhanced manageability Policy-based management (QoS)

Institute for Computationaland Mathematical EngineeringShared Storage Challenges Cluster Compute NodesPerformance, scalability &management–Single file system performancelimited–Multiple volumes and mount points–Manual capacity and load balancing–Large quantum upgrade costs

Institute for Computationaland Mathematical EngineeringMotivation for New Architecture A highly scalable, interoperable, shared storage system– Improved storage management Self-management, policy-driven storage (i.e. backup and recovery)– Improved storage performance Quality of service, differentiated services– Improved scalability Of performance and metadata (i.e. free block allocation)– Improved device and data sharing Shared devices and data across OS platforms

Institute for Computationaland Mathematical EngineeringNext Generation Storage Cluster Scalable performance– Offloaded data path enable directdisk to client access–Scale clients, network and capacity–As capacity grows, performance growsLinuxComputeClusterSingle Step:Perform job directlyfrom high I/O PanasasStorage ClusterSimplified and dynamic management–Robust, shared file access by many clients–Seamless growth within single namespaceeliminates time-consuming admin tasksParalleldatapathsControlpathIntegrated HW/SW solution–Optimizes performance and manageability–Ease of integration and supportMetadataManagersObject StorageDevicesPanasas Storage Cluster

Institute for Computationaland Mathematical EngineeringObject Storage Fundamentals An object is a logical unit of storage–Lives in flat name space with ID Contains application data & attributes–Metadata: block allocation, length–QoS requirements, capacity quota, etc. Has file-like methods–create, delete, read, write Three types of objects:–Root Object - one per device–Group Object - a “directory” of objects–User Object - for user data Objects enforce access rights–Strong capability-based access control

Institute for Computationaland Mathe

Rocks (Open source clustering distribution) Technology transfer of commodity clustering to application scientists – “Make clusters easy” – Scientists can build their own supercomputers and migrate up to national centers as needed Rocks is a cluster on a CD –