The Condor View Of Computing

Transcription

The Condor View of Grid ComputingJPGRID-TU0201-No.4The Condor View ofComputingMiron LivnyComputer Sciences DepartmentUniversity of Wisconsin-Madisonmiron@cs.wisc.eduComputing poweris everywhere,how can we make it usablebyanyone?www.cs.wisc.edu/condor1

The Condor View of Grid ComputingJPGRID-TU0201-No.4The Condor Project (Established ‘85)Distributed Computing research performed by a teamof 35 faculty, full time staff and students whoface software/middleware engineering challengesin a UNIX/Linux/Windows environment,involved in national and internationalcollaborations,interact with users in academia and industry,maintain and support a distributed productionenvironment (more than 2000 CPUs at UW),and educate and train students.Funding – DoD,DoE, NASA, NIH, NSF,AT&T, INTEL,Micron, Microsoft and the UW Graduate Schoolwww.cs.wisc.edu/condorClaims for “benefits” provided byDistributed Processing SystemsP.H. Enslow, “What is a Distributed Data ProcessingSystem?” Computer, January 1978High Availability and ReliabilityHigh System PerformanceEase of Modular and Incremental GrowthAutomatic Load and Resource SharingGood Response to Temporary OverloadsEasy Expansion in Capacity and/or Functionwww.cs.wisc.edu/condor2

The Condor View of Grid ComputingJPGRID-TU0201-No.4HW is a CommodityRaw computing power and storage capacityis everywhere - on desk-tops, shelves, andracks. It ischeap,dynamic,distributively owned,heterogeneous andevolving.www.cs.wisc.edu/condor“ Since the early days of mankind theprimary motivation for the establishment ofcommunities has been the idea that by beingpart of an organized group the capabilitiesof an individual are improved. The greatprogress in the area of inter-computercommunication led to the development ofmeans by which stand-alone processing subsystems can be integrated into multicomputer ‘communities’. “Miron Livny, “ Study of Load Balancing Algorithms forDecentralized Distributed Processing Systems.”,Ph.D thesis, July 1983.www.cs.wisc.edu/condor3

The Condor View of Grid ComputingJPGRID-TU0201-No.4Every communityneeds aMatchmaker*!* or a Classified section in thenewspaper or an eBay.www.cs.wisc.edu/condorWhy? Because . someone has to bring togethercommunity members who haverequests for goods and services withmembers who offer themBoth sides are looking for each otherBoth sides have constraintsBoth sides have preferenceswww.cs.wisc.edu/condor4

The Condor View of Grid ComputingJPGRID-TU0201-No.4We use Matchmakersto buildComputing Communitiesout ofCommodity Componentswww.cs.wisc.edu/condorHigh Throughput ComputingFor many experimental scientists, scientificprogress and quality of research are stronglylinked to computing throughput. In other words,they are less concerned about instantaneouscomputing power. Instead, what matters to themis the amount of computing they can harness overa month or a year --- they measure computingpower in units of scenarios per day, wind patternsper week, instructions sets per month, or crystalconfigurations per year.www.cs.wisc.edu/condor5

The Condor View of Grid ComputingJPGRID-TU0201-No.4High Throughput Computingis a24-7-365activityFLOPY Worker ParadigmMany scientific, engineering andcommercial applications (Software buildsand testing, sensitivity analysis, parameterspace exploration, image and movierendering, High Energy Physics eventreconstruction, processing of optical DNAsequencing, training of neural-networks,stochastic optimization, Monte Carlo.)follow the Master-Worker (MW) paradigmwhere .www.cs.wisc.edu/condor6

The Condor View of Grid ComputingJPGRID-TU0201-No.4Master-Worker Paradigm a heap or a Directed Acyclic Graph (DAG) oftasks is assigned to a master. The masterlooks for workers who can perform tasks thatare “ready to go” and passes them adescription (input) of the task. Upon thecompletion of a task, the worker passes theresult (output) of the task back to the master.Master may execute some of the tasks.Master maybe a worker of another master.Worker may require initialization data.www.cs.wisc.edu/condorMaster-Worker computing isNaturally Parallel.It is by no meansEmbarrassingly Parallel.Doing it right is by no meanstrivial.www.cs.wisc.edu/condor7

The Condor View of Grid ComputingJPGRID-TU0201-No.4ouranswer toHigh Throughput MW Computingon commodity resourceswww.cs.wisc.edu/condorThe World of Condors› Available for most Unix and Windows›››platforms at www.cs.wisc.edu/CondorMore than 350 Condor pools at commercialand academia sites world wideMore than 12,000 CPUs world wide“Best effort” and “for fee” supportavailablewww.cs.wisc.edu/condor8

The Condor View of Grid ComputingJPGRID-TU0201-No.4The Layers of CondorApplicationApplication AgentSubmit(client)Customer AgentMatchmakerOwner AgentRemote Execution AgentExecute(service)Local Resource ManagerResourcewww.cs.wisc.edu/condorThe Grid: Blueprint for a NewComputing InfrastructureEdited by Ian Foster and Carl KesselmanJuly 1998, 701 pages, 62.95The grid promises to fundamentally change the way wethink about and use computing. This infrastructure willconnect multiple regional and national computationalgrids, creating a universal source of pervasiveand dependable computing power thatsupports dramatically new classes of applications. TheGrid provides a clear vision of what computational gridsare, why we need them, who will use them, and how theywill be programmed.www.cs.wisc.edu/condor9

The Condor View of Grid ComputingJPGRID-TU0201-No.4“We haveprovided in this article a concise statement of thecontrolledresource sharing and coordinatedresource use in dynamic, scalablevirtual organizations. We have also presented“Grid problem,” which we define asboth requirements and a framework for a Grid architecture,identifying the principal functions required to enable sharingVOswithinand defining key relationships among thesedifferent functions.”“The Anatomy of the Grid - Enabling Scalable VirtualOrganizations” Ian Foster, Carl Kesselman and Steven Tuecke rCondorGridGlobus ToolkitCondorFabric (processing, storage, communication)www.cs.wisc.edu/condor10

The Condor View of Grid ComputingJPGRID-TU0201-No.4US Particle Physics Data Grid Project todayLBL – STAR,LBL - STACSU of Wisconsin CondorSLAC - BaBarFermilab - CMS, D0Caltech - CMSBNL – STAR,ATLASANL – GlobusANL - ATLASUCSD- CMSSDSC - SRBTJNF10 siteswww.cs.wisc.edu/condorPACI TeraGrid574p IA-32Chiba City32256p HPX-Class32248892p ltech24128p HPV2500424ExtremeBlack Diamond323240 Gb/s -12SDSCNCSA4.1 TF, 2 TB Memory225 TB disk6 2 TF, 4 TB Memory120 120 TB diskJuniper M40OC-12HR Display &VR Facilities55OC-48OC-12 ATM Juniper128p Origin321 TF0.25 TB Memory25 TB disk0.5 TF0.4 TB Memory86 TB uniper S2SunStarcat41176p IBM SPBlue Horizon4 32x 1GbE1024p IA-32320p IA-6416MyrinetMyrinet ClosClosSpineSpine 64x Myrinet 32x MyrinetNational Computational ScienceMyrinetMyrinet ClosClos SpineSpine141500p OriginSun E10K 32x FibreChannel 8x FibreChannel10 GbEMcKinley Servers(128p @ 4GF, 8GB memory/server)McKinley Servers(64p @ 4GF, 8GB memory/server)National Center for Supercomputing ApplicationsMcKinley Servers(128p @ 4GF, 12GB memory/server)Fibre Channel SwitchCisco 6509 Catalyst Switch/RouterIA-32 nodeswww.cs.wisc.edu/condorNational Computational Science11

The Condor View of Grid ComputingJPGRID-TU0201-No.4Customer orders:Run Job FServer dulingNotificationServerwww.cs.wisc.edu/condor12

The Condor View of Grid lingResultWorkerwww.cs.wisc.edu/condorKey Challenges› Trustworthy and Robust services› Effective Communication betweenconsumer and provider of servicesTell me what to doTell what you did and what happenedTell me what is going on› Reasonable recovery policies on client andserver sideswww.cs.wisc.edu/condor13

The Condor View of Grid ComputingJPGRID-TU0201-No.4Condor statusG-IDGlobus ResourceCondor submitGate KeeperCustomer AGJob ManagerGridManagerLocal dorCustomer orders:Run Job Fon the best CEServer delivers.www.cs.wisc.edu/condor14

The Condor View of Grid dor Glide-in:Expending your Condor pool“on the fly”and executing your jobson the remote resourcesin a “friendly”environment.www.cs.wisc.edu/condor15

The Condor View of Grid ComputingJPGRID-TU0201-No.4Match MakerCondorSubmit X 2Globus ResourceGate KeeperCustomer AGJob ManagerGridManLocal JobSchedulerAppAGGlide SE or UserLocal(Personal) Condor - GFlockingGlobus ToolkitPBSLSFCondorGlide-in Glide-in Glide-inRemote AppAppAppwww.cs.wisc.edu/condorCondorApp16

The Condor View of Grid ComputingJPGRID-TU0201-No.4It asterMW Library ibraryGateKeeperMW 7

The Condor View of Grid ComputingJPGRID-TU0201-No.4Master-Worker (MW) library› Manages workers – locates resources,lunches workers, monitors health ofworkers, › Manages work – moves work andresults between master and workervia files, PVM or TCP/IP socketswww.cs.wisc.edu/condorThe NUGn QuadraticAssignment Problem (QAP)nminp n a bij p(i)p(j)i 1 j 1www.cs.wisc.edu/condor18

The Condor View of Grid ComputingJPGRID-TU0201-No.4Despite its simple statement - minimize theassignment cost of n facilities to nlocations - it is extremely difficult to solveeven modest instances of this problem.Problems with n 20 are difficult; problemswith n 30 have not even been attempted yet.We currently hold the world record to solveNUG25 in 6.7 hours (previous record : 56days !!!). Our goal now is to solve NUG30,an unsolved problem formulated 30years c.edu/condor19

The Condor View of Grid ComputingJPGRID-TU0201-No.4NUG30 Personal Grid Flocking:-- the main Condor pool at Wisconsin (500 processors)-- the Condor pool at Georgia Tech (284 Linux boxes)-- the Condor pool at UNM (40 processors)-- the Condor pool at Columbia (16 processors)-- the Condor pool at Northwestern (12 processors)-- the Condor pool at NCSA (65 processors)-- the Condor pool at INFN Italy (54 processors)Glide-in:-- Origin 2000 (through LSF ) at NCSA. (512 processors)-- Origin 2000 (through LSF) at Argonne (96 processors)Hobble-in:-- Chiba City Linux cluster (through PBS) at Argonne(414 processors).www.cs.wisc.edu/condorNUG30 - Solved!!!Date: Thu, 15 Jun 2000 21:26:19 -0500Sender: goux@dantec.ece.nwu.eduSubject: Re: Let the festivities begin.Hi dear Condor Team,you all have been amazing. NUG30 requiredCondor Time. In justseven days !10.9 years ofMore stats tomorrow !!! We are off celebrating !condor rules !cheers,JP.www.cs.wisc.edu/condor20

The Condor View of Grid ComputingJPGRID-TU0201-No.4Solution Characteristics.Wall Clock TimeAvg. # MachinesMax. # MachinesCPU Time6:22:04:316531007Approx. 11 yearsNodes11,892,208,412LAPsParallel The WorkforceWorkerswww.cs.wisc.edu/condor21

The Condor View of Grid ComputingJPGRID-TU0201-No.411 CPU yearsin less than a week,How did they do it?Effective managementof their isc.edu/condorIt Works!!!Condor-XW / XtremWeb-C:Global Computing onCondorPoolsMiron LivnyComputer Sciences DepartmentFranckCappello,Lodygensky, Vincent NeriUniversityofOlegWisconsin-MadisonLRI - Université Paris sudmiron@cs.wisc.edu22

The Condor View of Grid ComputingJPGRID-TU0201-No.4XtremWeb-C (XW in Condor)Deploying XW Workers with CondorMerge Condor flexibility and XtremWeb connectivity. Use Condor to :¾ manage a pool of machines¾ dispatch XtremWebworkers as Condor tasksCondor Pool for able Pull mode task dispatchingin a Condor pool.www.cs.wisc.edu/condorExploration of conformationaltransitions in proteins "interesting"› Molecular Dynamics is great for simulatinge.g.,e.g., normalnormalprionprion proteinproteinrandom thermal deformations of a oid?)(amyloid?)but unlikely to reach a particular conformation ofthe protein, even if you really want to› Vibrational Modes is great for identifyingpreferred deformations towards "interesting"conformationsbut strictly applicable to small deformations only› Combined approach: we force moleculardynamics to explore "interesting"deformations identified by vibrational modeswww.cs.wisc.edu/condorDavid Perahia and Charles RobertUMR8619 CNRSUniversity of Paris-Sud Orsay France23

The Condor View of Grid ComputingJPGRID-TU0201-No.4Obtain free-energy profiles Explore low-energy (favorable) transition pathways Extend to multiple dimensions (energy surfaces)energy barrier4) Calculate free energy profile3) Gather statistics.etc.XtremWebWorkers CondorPools2) Perform m constrained molecular dynamics simulations for each ( n x m workers)1) Generate n starting conformations along coordinate of interestDavid Perahia and Charles RobertUMR8619 CNRSUniversity of Paris-Sud Orsay Francewww.cs.wisc.edu/condordeformation coordinate“The Grid”is not just a Grid ofresourcesit is a Grid oftechnologieswww.cs.wisc.edu/condor24

The Condor View of Grid ComputingJPGRID-TU0201-No.4Customer orders:Place y F(x) at L!Grid delivers.www.cs.wisc.edu/condorLogical RequestPlanning, scheduling,execution,error recovery,monitoring Physical Resourceswww.cs.wisc.edu/condor25

The Condor View of Grid ComputingJPGRID-TU0201-No.4A simple plan for y F(x) - L1.2.3.4.5.6.Allocate size(x) size(y) at SE(i)Move x from SE(j) to SE(i)Place F on CE(k)Compute F(x) at CE(k)Move y to LRelease allocated spaceStorage Element (SE); Compute Element (CE)www.cs.wisc.edu/condorWhat we have here isa simple six-nodesDirected Acyclic Graph(DAG)Execution of DAG must beControlled by clientwww.cs.wisc.edu/condor26

The Condor View of Grid ComputingJPGRID-TU0201-No.4Data Placement* (DaP)is an integral part ofend-to-endfunctionality* Space management andData transferwww.cs.wisc.edu/condorDAGManDirected Acyclic GraphManagerDAGMan allows you to specify thedependencies between your jobs(processing and DaP), so it canmanage them automatically for you.www.cs.wisc.edu/condor27

The Condor View of Grid ComputingJPGRID-TU0201-No.4Defining a DAG› A DAG is defined by a .dag file, listing each of itsnodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child DJob AJob BJob CJob D› each node will run the job specified by itsaccompanying Condor submit filewww.cs.wisc.edu/condorRunning a DAG› DAGMan acts as a “meta-scheduler”,managing the submission of your jobs toCondor-G based on the DAG dependencies.ACondor AJobQueueBC.dagFileDAGMan Dwww.cs.wisc.edu/condor28

The Condor View of Grid ComputingJPGRID-TU0201-No.4Running a DAG (cont’d)› In case of a job failure, DAGMan continues until itcan no longer make progress, and then creates a“rescue” file with the current state of the DAG.ACondorJobQueueBXRescueFileDAGMan Dwww.cs.wisc.edu/condorIt Works!!!!High Energy Physicswww.cs.wisc.edu/condor29

The Condor View of Grid .edu/condorPPDG-MOP US-CMS Test bedwww.cs.wisc.edu/condor30

The Condor View of Grid ComputingJPGRID-TU0201-No.4How Does MOP Work?› From the perspective of the CMS›production system (IMPALA), MOP isalmost like a local batch system. Insteadof submitting jobs to PBS or Condor, thesystem can submit them to MOP.For each physics job that IMPALA submitsto MOP, MOP creates a DAG containingsub-jobs necessary to run that job on theGrid www.cs.wisc.edu/condorMOP JobStagesMOP Job Stages› Stage-in – get the program and its››››data to a remote siteRun – run the job at the remote siteStage-back – get the program logsback from the remote sitePublish – advertise the results sothey will be sent to sites that want itCleanup – clean up remote sitewww.cs.wisc.edu/condor31

The Condor View of Grid ComputingJPGRID-TU0201-No.4MOP Job StagesCombined DAG› MOP combines thefive-stage DAGfor each IMPALAjob into one giantDAG, and submitsit to r32

The Condor View of Grid ComputingJPGRID-TU0201-No.4It Works!!!!Sloan Digital Twww.cs.wisc.edu/condor33

The Condor View of Grid ComputingJPGRID-TU0201-No.4Chimera Virtual Data System› Virtual data catalogTransformations,derivations, dataVirtual DataApplicationsChimera› Virtual data languageVirtual Data LanguageCatalog DefinitionsTask Graphs(compute and datamovemment tasks,with dependencies)VDL Interpreter(manipulate derivationsand transformations)› Query Tool› Applications includeData Grid Resources(distributed executionand data management)XMLbrowsers and dataanalysis applicationsVirtual Data Catalog(implements ChimeraVirtual Data Schema)GriPhyN VDTReplica CatalogDAGmanGlobus Toolkit, Etc.www.cs.wisc.edu/condorArgonne National LaboratoryCluster-finding Data .wisc.edu/condorArgonne National Laboratory34

The Condor View of Grid ComputingJPGRID-TU0201-No.4Small SDSS Cluster-Finding DAGwww.cs.wisc.edu/condorArgonne National LaboratoryAnd Even Bigger:744 Files, 387 Nodes5060168108www.cs.wisc.edu/condorArgonne National Laboratory35

The Condor View of Grid ComputingJPGRID-TU0201-No.4Cluster-finding Gridwww.cs.wisc.edu/condorWork of: Yong Zhao, James Annis, & othersThecustomerdepends onyou, be logicaland incontrolwww.cs.wisc.edu/condor36

The Condor View of Grid Computing JPGRID-TU0201-No.4 1 Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu