Evaluation Guide

Transcription

Evaluation Guide Tidal SoftwareEvaluation GuideTidal Enterprise Scheduler

Tidal Enterprise SchedulerPurpose The purpose of this document is to discussthe requirements for job scheduling softwarein the modern distributed enterprise. Thisguide provides a comprehensive list of key jobscheduling features and a description of eachfeature; a companion checklist is also available.The checklist is designed to assist those activelyevaluating job scheduling and event automationsoftware by providing a tool for quickly andefficiently gathering information on productsunder consideration.To set the proper context for a detaileddiscussion of key job scheduling features, thisdocument includes some background and historyon the evolution of the job scheduling marketplace.This background will help the reader understandhow changing business and industry requirementsimpact the job scheduling arena. Among thehandful of disciplines that routinely take place inthe data center, job scheduling may be the mostimportant of all. This is a bold statement, giventhat job scheduling competes with other importantsystems management functions like file backup,network management and security.While these are important disciplines in theirown right, there is no arguing that, dependingon the size of the enterprise, a job scheduler isroutinely managing thousands (or, in many cases,tens of thousands) of individual business processesevery day. In fact, the number of processes involvedis so large that a manual approach is completelyunfeasible. Custom job scheduling solutions thatrely on native operating system utilities such asCRON, NT Scheduler, PERL and VB scripts quicklybecome unworkable and collapse under the weightof their own unmanageable ‘spaghetti code.’ Giventhis backdrop, it is easy to see how job schedulersare an indispensable part of your IT infrastructure.BackgroundThe discipline of job scheduling was firstestablished in the 1970s when it became a keystrategic infrastructure component for largemainframe-oriented data centers. A variety ofproducts were created and extensively marketeduntil it became widely accepted that a robustjob scheduling tool was required if you were toeffectively manage applications on the mainframe.During this period of mainframe-centriccomputing, a common understanding of the keyfeatures of job scheduling began to emerge. WhenUnix began to make inroads into mainstream datacenters in the mid-1990s, IT managers widenedtheir search to job scheduling solutions formanaging the distributed environment.As the shift to UNIX began, few of the existingmainframe vendors created new job schedulingofferings to fill the void. Instead, the mainframevendors and many mainframe-oriented datacenters experimented with attempts to managedistributed workload from the mainframe. AsUNIX continued to make inroads into the datacenter, a new group of competitors entered themarket with products created expressly formanaging distributed job scheduling.As the two competing approaches weredeployed, it quickly became apparent that theproducts create expressly for the distributedenvironment were a far better approach, and themainframe approach to managing distributedworkload was relegated to a small marketsegment. Even so, many mainframe data centersstill cling to the elusive dream of managing allworkload regardless of platform from a singleconsole. Unfortunately, few if any have beenable to achieve this goal and few vendors appearfocused on the issue.The early distributed offerings proved tobe reasonably robust and did a passable jobof mimicking mainframe scheduling featuresand functions, but they suffered from beingfirst generation products. Eventually, all of theprominent vendors of these products ‘hit the wall’in terms of scalability, flexibility and ease of useand were ultimately acquired in the late 1990s bymainframe companies who were still grapplingwith the unique issues and challenges of thedistributed marketplace.During the period that the first generationproducts were being acquired, newer competitorsbegan crafting more advanced solutions for jobscheduling in the distributed environment. Thesenewer vendors had several distinct advantagesover their first generation counterparts including:1. Improved development technologies. Theearly products were generally challengedbecause of the limitations of the basetechnology used to develop them. By theearly to mid-90s, development tools for thedistributed environment had not reallymatured sufficiently to be of significant value,

Evaluation Guideso products of that genre were generallybuilt on older, harder to maintain code basesand used the UNIX Motif standard for theirgraphical user interface. In fact, a numberof the first generation products were actuallyported from other proprietary platforms inthe early 1990s, with the usual complicationscaused by porting aging software products.The combination of the older, inflexible codebases, lack of robust development tools and thelack of any GUI standards created problemsas the market began to mature and therequirements for distributed scheduling beganto change. These first generation productscould not keep pace with these changingrequirements and began to falter in terms ofscalability, ease of use, ease of installation andconfiguration, and maintainability.2. An understanding of cross-platformdevelopment. Like their mainframepredecessors, the early UNIX productsessentially had a single-platform orientation:the only difference being that the platform wasUNIX instead of MVS or OS/390. This problemcame into focus when Windows 95/98 took overthe desktop and Windows NT/2000 began toinvade the data center.While the early UNIX products had the initiallead on Windows NT/2000, their shortcomingsfor true cross-platform, distributed schedulingbecame evident as more and more companiesattempted to incorporate Windows NT/2000into their data center operations. Newergeneration of job scheduling products arecapable of exploiting the ease of use of theWindows desktop and newer developmenttechnologies. These newer products alsoreflect a deeper understanding of the rigorousrequirements for building a true cross-platformsolution that incorporates UNIX, Windows andother platforms.3. A more mature marketplace. The latestgeneration of products has been developedwith large distributed workloads in the formof applications like SAP R/3, PeopleSoft andOracle already running in production. Theprevious generation of products was built onolder technology and without the benefit ofbeing battle-tested with significant workloads.Consequently, these early scheduling tools didnot have the power and scalability to managelarge distributed workloads when it inevitablybecame a requirement. The shortcomingsof these products may not have been readilyapparent for several years, at which pointtheir users were then forced to look for areplacement.Many companies are in this position today –looking to replace their first generationjob schedulers because of scalability andmaintenance issues. The latest generationof products is designed to manage tens ofthousands of jobs daily. Unfortunately for theearly vendors – and their customers, it is mucheasier to design a product to be scalable frominception than to try to retrofit scalabilityinto an older design.4. An understanding of true 24x7x365operations. When mainframe job schedulerswere first created, there were two distinctperiods of operation in the data center– online and batch. Typically during theday, the databases were ‘online’ and onlyavailable to end-users for true “real time”tasks that involved entering transactionsat their terminals. At night, the databaseswere brought ‘offline’ and then batch activity,typically reporting and other batch intensiveactivities, were allowed to process. Thisbatch window was often as long 12 hours (forexample, 6:00 pm to 6:00 am). After the firstgeneration of distributed scheduling productswas built in the early 90’s, people beganto speak of the ‘shrinking batch window;’however, for the new UNIX platform, the batchwindow was still very much intact.By the late-90’s, when the latest generationof job scheduling products was being created,two significant issues had started to changethe landscape for corporations and softwareproviders alike, namely globalization ande-business. These phenomena began tooverwhelm data center operations with theneed for true 24x7x365 operations. The neteffect was the virtual elimination of the batchwindow which essentially created a new setof requirements for scheduling tools. Thelatest generation of job schedulers is ableto respond to these evolving requirementswith the addition of new features forimproved scalability and better, faster

Tidal Enterprise Schedulerworkload management. Ultimately, thesenew requirements require the job schedulerto be a true “workload manager,” respondingto a wide variety of events for launching andmanaging jobs, and not merely relying on thetraditional date and time model of schedulingjobs. A number of these features includingevent management are discussed later in thisdocument.Over the years, job scheduling has maturedas a discipline and become accepted asa requirement for any comprehensiveinfrastructure management strategy.During this maturation phase, the industryhas developed an understanding of whatthe common, or core, functions that aproduct needs to be considered a ‘serious’ or‘industrial strength’ product. This sectionof the evaluation guide discusses this coreset of features commonly acknowledged asrequirements for any serious scheduling tool. Business CalendarsAt the heart of any scheduler is the conceptof starting jobs at a particular date and time.Internally, businesses actually run on avariety of calendars including fiscal calendars,manufacturing calendars, payroll calendars,holiday calendars and others that all drive specificaspects of a modern business. Job schedulers areexpected to provide comprehensive calendaringfacilities that are easy to use, graphical in natureand able to combine calendars with one anotherto achieve a desired result (for instance, whatday should payroll be run if it happens to fall ona holiday?) Surprisingly complex calendars canbe required to manage a modern corporationwith some companies managing several hundreddepending on their size and number of geographiclocations. Because of the vital nature of thisfeature, buyers are advised to scrutinize thecalendar capability of any product to see that itmeets their needs in terms of flexibility, ease ofuse and complexity.DependenciesRight after the concept of the businesscalendar in terms of importance, comes the ideaof the ‘dependency.’ As the word implies, this is theability of the scheduler to control the execution ofthe various tasks by having them execute not juston a certain date, but also in a particular orderand in conjunction with other tasks. The simplestexpression of a dependency is Job B follows Job A.In other words, Job B cannot execute unless JobA has completed. However, this simple conceptcan get very complex very quickly. Job B mighthave to wait for a particular file to arrive, or itmight have to wait for some data to be input bya user, or it might be restricted from runningafter a certain time in the evening. When lookingat the dependency capabilities of a product, itis important to have a clear idea of the types ofdependencies that exist and to be able to easilymap those to the product. If the product cannotcreate the calendar that is needed or run jobs inthe correct order and with the right dependencieshaving been satisfied, then it is not likely to evermeet a site’s core business needs and will be verycumbersome to use.

Evaluation GuideAuto RecoveryIn a perfect world, jobs would run correctlyevery time; however, such is not the case. Fromcorrupt data file to users entering erroneousdata to programmers making mistakes, all ofthese anomalies can lead to late or incorrectprocessing. To deal with such unpredictableissues, sophisticated job schedulers are expectedto accommodate automated recovery actions. Thisfeature needs to be flexible enough to allow fora variety of responses to a ‘failure.’ For instance,some failures are not meaningful enough to stopsubsequent processing, while others will corruptdownstream processing and result in otherfailures, inaccurate reports and other problems.The product should be able to take distinctlydifferent actions based on the relative severity of aproblem encountered.Additionally, the product should allow the userto create multiple types of recovery scenarios. Forinstance, in some cases it might be sufficient forthe scheduler to simply stop processing altogetherif the error is deemed severe enough. In othercases, it might be decided that when a specifictype of error occurs the schedule should back upa couple of steps and rerun those jobs. If that isnot sufficient, the user may run a series of relatedrecovery actions like restoring a database priorto attempting to run the jobs again. Minimally,the product selected should allow the user tostop processing, continue processing and/or runa series of recovery actions before moving on tosome subsequent step.Alert ManagementBecause job schedulers perform such avital role in the infrastructure, they must beable to generate an alert when somethingunusual happens with the processing. This alertmanagement needs to be flexible enough tohandle a wide array of potential events and alsoextensible so that it can identify and manageevents of the users’ choosing. Like auto recoverythis feature needs to be able to respond differentlyto different types of events; in fact, alertmanagement and auto recovery often need to workin conjunction with one another.In a typical scenario, a job might fail, whichin turn, initiates some type of recovery action.At the same time, there may be a desire to sendnotification of the failure to a designated person.This notification might be in the form of an emailto a specific user, a page to a technician, or amessage sent to a central management console.Additionally, this alert management should allowfor some type of acknowledgement of the alertso that the scheduler itself is informed when thetargeted person has received the alert. As withother features, ease of use and flexibility are key.Some products say that they have this alertcapability, but closer examination reveals thatthis is only possible if the user writes a variety ofscripts to alert someone to a particular problem.

Tidal Enterprise SchedulerEnterprise Application Support The primary reason for implementing a jobscheduling solution is to be able to supportcompanies mission-critical applications. Thisrequirement has existed since the early days ofthe mainframe and is still true today. What hasbecome more complicated is the sophisticationrequired to successfully integrate with today’sleading applications. Many of today’s leadingapplications – SAP, PeopleSoft, Oracle EBusiness,Siebel, etc., have some type of built in scheduler.At the same time, this scheduling capabilityis insufficient to manage all of the variousapplications in the enterprise, so integration ofthese core applications becomes a central issue.Selected job scheduling vendors have extendedtheir products to more readily encompass theneeds of these critical applications and createdinterfaces that talk directly to the core schedulingfunctionality in these applications. If you haveone of the applications listed above, there arecommercially available interfaces available andit would be worthwhile for you to evaluate them.At the same time, not all interfaces are createdequally, so take the time to understand howthe vendor actually supports the interface, howcurrent the interface is, what specific features areavailable to support the application and whetheror not the application support is certified and onwhat level the certification was achieved.Framework/NetworkManagement IntegrationMany companies today have implementedsome type of network or systems managementconsole. These products provide a variety offeatures and functions, but in many cases areimplemented to provide a single-console view ofthe enterprise. This single console typically dealswith the notion of “management by exception,”which is simply the idea that given the incrediblenumber of events that occur within a moderate tolarge IT shop, the operations personnel only wantto deal with the exceptional conditions(typically the most serious errors).Because a huge number of IT processes runin batch mode, it is critical that the schedulerintegrate with the data center’s chosen frameworkor network management console. Typicallythis integration will be twofold: First, andmore obvious, the integration should allow thescheduler to inform the network console whenthere is some type of job failure (this is, in effect,another type of alert management as discussedabove). Secondly, the integration should allowthe network management tool to monitor thescheduler and its associated infrastructure.Although not as obvious, this integration gives thenetwork tool the ability to monitor the health ofthe scheduler itself. In this way, if the scheduleror one of its distributed components shouldexperience an error, the network managementconsole should be able to report on any failures.SecurityIt should be apparent that security is avital requirement of a job scheduler. If your jobscheduling product is in charge of running themission-critical processes in your data center, thenclearly you must control access to it. What maynot be so obvious, is that in addition to controllingaccess to the scheduler itself, an administratormay also want to control access to individualfeatures within the product by user or by group.This requirement exists because of thediversity of users, not all of whom need access toall scheduling functions. For example, operationspersonnel are typically given broad access to thetool, but you might want to restrict their accessto certain jobs or certain features within thescheduler. In some corporations, end users are

Evaluation Guidegiven limited access to the product so that theycan monitor the progress of jobs of particularinterest to them.Other personnel may have the authorizationto create certain business calendars or jobs, butthey do not have the ability to run those jobs inproduction.The key when looking at the scheduler’ssecurity features is to look at ease of use andgranularity. Ease of use is necessary for quickauthorization changes for a given user or groupof users in response to changing business needs.It is dangerously shortsighted to compromise anoperation’s security policies simply because it isdeemed too difficult to implement a particularpolicy. Granularity, or refinement, is importantbecause of the need to make only certainfeatures of the product available to certain users.Granularity makes it possible to easily grantprecisely the types of user rights to just thoseusers who need them.Audit TrailsMany people relate audit trails to securityand while there is a strong connection, this is notthe only benefit of audit trails. With audit trailsin place, operations personnel can monitor and,when necessary, undo changes to the schedulingenvironment. For instance, even authorized andwell-intentioned users can make mistakes whenmodifying the production schedule. Audit trailsthe means to understand precisely what changeswere made to the production environment, andwho made them. Given the rapidly changingrequirements in data centers, it is important thatall changes to the production environment arerecorded.A related topic is the concept of systemor scheduling logs. While true audit trailsare typically focused on what changes weremade, by whom, and when, logs are simply thedocumentation about the results of a particularexecution of a job or schedule. Well-organized logsgive the user a clear indication of exactly whatworkload ran and when it ran.Ease of UseAlthough th

Tidal Enterprise Scheduler 4 workload management. Ultimately, these new requirements require the job scheduler to be a true “workload manager,” responding to a wide variety of events for launching and managing jobs, and not merely relying on the traditional date and time model