Condor

Introduction

Condor is a high-throughput computing system for automatically distributing computing tasks across a pool of computers.

This generally means that a job will be sent to the first unclaimed node that meets the job requirements.

Condor at CSE

The Condor Pool

Currently, most CSE Linux lab machines are part of the Condor pool, but the general CSE login servers are not.

To access the CSE Condor pool, you need to be a member of the condor group; contact SS if you need to join the group.

It's possible to start jobs from any computer in the pool, but we would prefer that you log into condor.cse.unsw.edu.au to start them.

Important binaries

condor_status (Pool status)

For the purposes of checking the status of a node in the Condor pool, condor_status is the binary to use.

This can be run at any machine that has Condor installed but only lists the status of machines that can run jobs.

That is, it lists whether or not a node is available.
Unclaimed
A Condor job can be set to run on that machine.
Owner
esources are currently being utilised by a non-Condor process.
Claimed
A Condor job is currently running on that machine.

condor_submit (Job submission)

To submit a job to the Condor pool, one must first create a submit file, which is a simple text file that tells Condor the details of the job as well as its machine requirements. An example submit file could look like: Executable = hello Universe = vanilla Output = hello.out Log = hello.log Queue As is probably obvious, this would run the hello executable in the vanilla universe, write output to hello.out and keep a log of its progress within the Condor pool in hello.log. It's a good idea to be explicit with the paths of any files specified in the submit file. Once the file is created (let's say we created hello.submit in this case, though it can be called anything), one should use condor_submit to send the job to the queue. In this example, the command would be: condor_submit hello.submit Doing this will tell the server of the details of the job, and the server would then try to match it up with an appropriate machine. To check the status of a job, one can check the log file specified in the submit file, or use condor_q. See the official project manual for more examples of job submission files.

Important note for large jobs

The default behaviour is for a notification to be sent via email upon the completion of a job. For jobs with many iterations, this may cause spam-like behaviour from our mail systems. To prevent this, you should include the following line in your submit file. notification = never This tells the system not to send any email notifications upon job completion.

condor_q and condor_rm (Job management)

To check the status of queued jobs sent from the current machine, one can use condor_q. The -long and -analyze switches can be used for more detail. To remove a job from the queue, condor_rm followed by the job number (which can be determined from the log file or by using condor_q) can be used.

condor_compile (Checkpointing)

Some jobs can be relinked with condor_compile to take advantage of Condor's checkpointing system in the standard universe. An example of this is the submission of hello. One would start with a C++ code file hello.c. One would then create the object file: g++ hello.c -c hello.o -o hello.o The next step would be to relink with condor_compile condor_compile g++ hello.o -o hello The job can then be submitted using the standard universe, and it will be checkpointed and migrated to another machine whenever a machine enters the owner status for an extended period of time.

Universes

Condor offers several different universes in which to run jobs. The universe in which a user wants a job run must be specified in the submit file. The following universes are available:
  • Standard for jobs that can be relinked with condor_compile
  • Vanilla for jobs that can't be relinked
  • PVM
  • MPI
  • Globus
  • Java for running Java jobs on a JVM
  • Scheduler
A detailed description of each universe can be found here.

Detailed documentation

Further Condor documentation can be found at the official project homepage, and in particular the manual

Check Condor is working and jobs are being scheduled on multiple computers

ssh to a condor node, eg robles, and create a test directory

% ssh robles % mkdir condorTest % cd condorTest

create a C program file to sleep for a specified number of seconds

Use text editor to create 'simple.c' with the following code #include < stdio.h > main(int argc, char **argv) { int sleep_time; int input; int failure; if (argc != 2) { failure = 1; } else { sleep_time = atoi(argv[1]); sleep(sleep_time); failure = 0; } return failure; }

compile the program using condor_compile

% condor_compile gcc -o simple.std simple.c There are a lot of warnings which can safely be ignored. The executable is fairly big, partly because of condor's checkpointing and partly as it is now statically linked Make it slightly smaller by getting rid of debugging symbols: % strip simple.std % ls -l total 1160 -rw-r----- 1 simong simong 271 Mar 30 15:07 simple.c -rwxr-x--- 1 simong simong 1176456 Mar 30 15:08 simple.std

create a submit file to instruct condor to start 60 jobs lasting between 10 and 20 seconds

Use a text editor to create file 'submit' with the following code. notification = never Universe = standard Executable = simple.std Log = simple.log Output = simple.out Error = simple.error Arguments = 10 Queue 20 Arguments = 15 Queue 20 Arguments = 20 Queue 20

start the condor jobs and check they are running

To start the jobs ... % condor_submit submit To check they are running, all or nearly all should be running, few if any should be in an idle state. % condor_q To check which machines they running on ... % condor_status -claimed
Last edited by simong 08/08/2017

Tags for this page:

condor, distributed, computing