Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Unix Operating Systems Software

Implementing Better Task Scheduling for Servers? 30

trifakir asks: "We are running some quite expensive SunFire servers with Solaris 8. In the 'crontabs' of these hosts we have scheduled maybe some hundred odd jobs, which are constrained by multiple factors: dependencies on other jobs, time constrains, CPU and memory usage, network bandwidth, and so on. Obviously this imposes a CSP. On the other hand the number of these jobs, each one of them can take from minutes to hours is growing and we are now experiencing performance problems given the limited resources we have. Of course we have opened the bag-of-tricks with our best *ad-hoc* solutions, using mostly Open Source software, to turn our system into an event-based and less dependent on the scheduling expertise of the admins. At certain point we were considering using AutoSys and I was looking for a grid-like scheduler like OpenPBS, both of which were discarded for various reasons. I am curious, how you guys, would solve this problem, which seems very trivial for many environments. Both advice about theory (scheduling) and practice will help us and any other readers who may be tackling this difficult problem."
This discussion has been archived. No new comments can be posted.

Implementing Better Task Scheduling for Servers?

Comments Filter:
  • OMFG!? (Score:3, Funny)

    by Anonymous Coward on Wednesday July 14, 2004 @05:14PM (#9701197)
    I think this is the first Ask Slashdot question ever asked that can't be answerd by a search on Google. Nice!

    Now let's see if anyone even has an answer...
    • Re:OMFG!? (Score:4, Funny)

      by jon787 ( 512497 ) on Wednesday July 14, 2004 @05:29PM (#9701362) Homepage Journal

      Now let's see if anyone even has an answer...

      No, no, no thats not how an Ask Slashdot works!

      One group berates the person for not Googling first.
      Another points out this has been asked before.
      A third group goes and argues about a minor detail in the question instead of the real issue.
      A fourth will make jokes completely irrelevant to the issue
      The next group will troll the debate by saying Windows already does it.
      One person will eventually answer the actual question but they will get modded to -1 because their answer isn't nearly as interesting as the rest of the comments.
  • by c0d3h4x0r ( 604141 ) on Wednesday July 14, 2004 @05:52PM (#9701548) Homepage Journal
    I am curious, how you guys, would solve this problem, which seems very trivial for many environments.

    Oh my god! Christopher Walken posts on Slashdot!

  • by crstophr ( 529410 ) on Wednesday July 14, 2004 @05:52PM (#9701552) Homepage
    First off, let me warn you. Do whatever you can to make someone else support this. The tangled web you weave becomes a nighmare after a few years.

    We use a software package called Control-M. It works on mainframes, unix, and windows. Jobs can be scheduled to run within windows of time, and can depend on out conditions of one or many other jobs. You can specify thing like the second tuesday of the month, etc. Jobs on UNIX can depend on successful status of jobs on the Mainframe, etc. It really does it all.

    A few problems from the sysadmin perspective:

    The user interface, job naming conventions and client/server/gui server model appears to have been designed by drunken crack smoking mainfame geezers. (ALL CAPS ABRVTDJBNMS) It is one of those interfaces where you'll find yourself randomly clicking on buttons and menu items while praying and cursing.

    Developers see it as a crutch (why check dependencies when the scheduler is supposed to do it?) and a nice place to point fingers when things don't work. They can get away with writing little scripts and then inserting them into this tangled mess of jobs and dependencies. That mess is your problem.

    You'll get hammered with requests to add/delete and change jobs for whatever reason. It will become this central time clock from which most major business processing is done. Once everything is migrated from cron (*heh*) then you're very vulnerable to problems. Oh yes, have a problem late at night and the execs precious, TPS reports don't arrive on thier desk every morning. Heads roll....

    It's hell to manage and support. It's really a half time position for a large organization. Send some poor sucker to the class and then make them responsible for the whole thing.

    Support is decent. As good as most vendors.

    So Control-M, does everything you could need, but is probably the most miserable application I've ever had the displeasure of managing.

    Oh yeah, and it is expensive $$$$$$$

    Best of luck. If you're really THAT bound by resources buy more severs and spread the load.

  • MAKE, or PMAKE (Score:4, Interesting)

    by fdragon ( 138768 ) on Wednesday July 14, 2004 @05:54PM (#9701565)
    I am not sure on the details as I have not done this myself.

    But in your situation I would be creating a make file to schedule the jobs with. Make can handle concurrency and with available patches can be made to distribute jobs to multiple nodes. Parallel Make Patches for GNU Make [llnl.gov].

    In a method like this I would recomend a small shared file system so that as you complete each job you can touch a file. This would allow you to continue from the point you left off, or if you wish, clear it out and start over.
  • autosys rules (Score:2, Informative)

    by emphatic ( 671123 )
    you didn't mention why you discarded pretty common solutions for this problem, namely why you didn't like autosys. it's not free, nor cheap, but it's ability to group tasks together and control dependencies between jobs, groups of jobs and resources is pretty nice. just get used to using return codes in your scheduled jobs and you're good to go, no other change in the jobs you already have required.
  • openpbs (Score:5, Informative)

    by hackstraw ( 262471 ) * on Wednesday July 14, 2004 @06:17PM (#9701797)
    If your interested, put some kind of reply so that I can get in direct contact with you.

    I use openpbs (patched out the wazoo) with maui [supercluster.org] as the scheduler. The scheduler that comes with pbs sucks. Bad.

    PBS can do dependancies, and you can set up node properties for heterogenous environments.

    I hate to say this, but buying faster and cheaper machines may help too. Sun/Solaris is slow. No flames intended, but its a fact. Fortunately your not using solaris 9 with its 30% decrease in TCP/IP performance vs 7. I'm not sure how robust 8 is, but 9 has too many "features" (read bugs) for my taste.

    Maui also works with other resource managers. The maui people have also forked off OpenPBS to something that is "better", YMMV. Maui also has a text based interface called wiki that you can make your own resource manager.

    The info in your problem description is kinda lacking, but there should be a reasonable solution to your problem.
  • by Anonymous Coward on Wednesday July 14, 2004 @06:37PM (#9701954)
    First of all I don't know if you actually need to run certain tasks at certain specific times. If so, you will need to use cron after all. But here's some ideas:

    I had a app server that ran a number of critical tasks, and they had a somewhat arbitrary and complex dependency graph. A good analogy would be eBay's indexing cycle: a bunch of stuff has to happen as often as possible, but it's not really important if it takes 30 minutes or 35 minutes or 1 hour to do each cycle.

    Also it needed to be easy to extend: a programmer should be able to write the code and "stick it somewhere" to extend the system.

    The previous admin had set some tasks to run in cron every 5 minutes, not realizing that after a while some of the jobs actually took 6-7 minutes and growing (you can imagine what the process table looked like after a while, and some of these were not locking resources properly.....)

    I came up with a system using Makefiles (takes care of interdependencies and the -j flag will run indepent processes in parallel) and djb's daemontools [cr.yp.to] package.

    If you're not familar with daemontools, it is an incredibly tight little set of tools that lets you *atomically* and *reliably* start, stop, and configure daemons, and it lets you turn ANY script into a daemon. It just runs a "run" script you supply, and when the script dies for any reason, it restarts it. So you can create a script like this:

    #!/bin/sh
    make -C /foo/bar/baz update
    sleep 60

    and it will run it over and over again. Combine this with resource limits and multilog logging and you have a bulletproof way to keep things going in the background.

    So I set up the dependencies in the Makefile, threw in a couple scripts to run all the scripts in a couple "drop box" directories that programmers can use, and documented everything and made a web interface for checking the results. Now it doesn't really matter if the cycle takes 5 minutes or 10 minutes or 30 minutes, the makefiles are run over and over again in a loop, keeping things up to date.

    Again, I don't know the specifics of your needs but this is definitely something to consider. Especially if your crontab has grown into a huge confusing mess, and you don't actually care what exact time things are running.
  • How about at (Score:3, Insightful)

    by np_bernstein ( 453840 ) on Wednesday July 14, 2004 @07:45PM (#9702479) Homepage
    move all of your cronjobs to at scripts.
    In the beginning of each script, you have conditional checks as to what conditions need to be in place for the job to run, if they are met, then do what you have to do, if they don't, reschedule it for 5 minutes from now, and write it to a log. If the next run doesn't work, reschedule for 10 minutes, then 15 until a max is reached, when it dies and sends and email to the admin. Pretty basic, but with a little work it would work fine.
  • It's very trivial in our environment. We just send AppWorx [appworx.com] a large check every year, then keep two or three contractors busy tending it. It handles all our needs, except those met by MQSeries [ibm.com]. But hey, we're a twenty billion dollar business. Your mileage may vary.
  • Topological sorting (Score:3, Interesting)

    by Sesse ( 5616 ) * <sgunderson@big[ ]t.com ['foo' in gap]> on Wednesday July 14, 2004 @08:14PM (#9702662) Homepage

    Almost sounds like a candidate for simply topologically sorting the entire dependency graph and then processing the tasks in order. You'll eventually end up with several non-connecting subgraphs which can be parallelized; the big remaining problem seems to be prioritizing the tasks if you don't want them to be run equally often, but I'm not really sure if I've understood your problem correctly. :-)

    /* Steinar */

  • by T-Ranger ( 10520 )
    The Job Control Language.
  • by chriskenrick ( 89693 ) on Thursday July 15, 2004 @01:32AM (#9704576)
    Have you come across the ARMTech [aurema.com] resource management product from Aurema?

    Instead of scheduling when jobs run to take care of CPU/memory constraints, you can set a policy of resource usage, which ARMTech then enforces (eg Application A can only have X amount of memory at maximum). Policy can be adjusted whenever the need arises. It might help in solving your problem.

    Disclaimer: I am an employee of the above company.
  • by jmorey ( 38458 )
    A long time ago I used a piece of free software called NQS to do someting along these lines. I found the following reference (this is were I originally downloaded the software):

    http://hpux.cs.utah.edu/hppd/hpux/Networking/Adm in /nqs-2.5/
  • by wizzy403 ( 303479 ) * on Thursday July 15, 2004 @10:19AM (#9706989)
    As much as I love to support Open Source, there's times when it's just not worth the pain. If you're already spending tens (if not hundreds) of thousands of dollars for Sun hardware (nothing wrong with that, we do it here) then pony up the cash for a REAL Enterprise Scheduler. Look at Tivoli. Look at Tidal. Look at Cybermation. They all have different strengths, so depending on what your organization needs, YMMV.

    We standardized on Tidal. Yep, it's expensive. Our IT budget for this year is in the $1M range, so it's time to step up into the big leagues.
  • by "Zow" ( 6449 ) on Thursday July 15, 2004 @10:53AM (#9707355) Homepage

    This is a common problem on supercomputers: you have lots of users that want to run lots of jobs that have conflicting requirements for resources, and typically some dependancies between jobs and the like. Take a look at some of the scheduling and resource management tools available for supercomputers and maybe one of those will scratch your itch.

    A couple pointers to get you started:

    • SLURM [llnl.gov], which while designed for Linux clusters is a good system and at least should seed a Google search (disclaimer: I work for LLNL and am on the user end of slurm, and I'm only speaking for myself here).
    • Condor [wisc.edu] is a lot more than scheduling, but it does that as well.

    Those are the ones I think it would be useful to look at for now. Most of the other systems are vendor specific.

    -"Zow"

    • by wik ( 10258 ) on Friday July 16, 2004 @06:23PM (#9722303) Homepage Journal
      I have mod points, but I'd rather add to the dicussion. Condor does an excellent job of scheduling tasks with the resource constraints that you mentioned and it works across machines. I'd highly suggest looking into it. While it doesn't have a periodic submission feature (AFAIK, I wouldn't use that, even if it did), I don't see why you couldn't use cron to submit an initial condor job that starts everything else off every day.

      The dag scripts handle dependencies (setting these up the first time might be a big hairy). Setting resource requirements is easy. Scheduling comes for free.

  • As long as we're tossing out commercial solutions, one that I haven't seen mentioned is Unicenter, from Computer Associates. Among all its other myriad features, it has a first-class workload management component. The place I'm working at now uses it, and it supports job dependencies, time constraints, resource constraints and about a zillion other conditionals. It's even somewhat reasonable to configure, once you get into it and learn its terminology. It does everything Control-M does, and then some, p

The Tao is like a glob pattern: used but never used up. It is like the extern void: filled with infinite possibilities.

Working...