Introduction
Condor is a high-throughput computing system that offers the option to use dedicated nodes as well as idle non-dedicated nodes for distributing its tasks. Condor "schedules" tasks via a main scheduling server (or servers), so the user doesn't have to individually pick out which computer they want to run a job on. This generally means that a job will be sent to the first unclaimed node that meets the job requirements.Accessing the Condor pool
Currently, a user needs to be part of the condor group in order access the CSE Condor pool. Please contact SS if you require access. Once a member of the group, the user can log on to any machine in the pool and start using Condor.The Condor pool
Currently, all lab machines are part of the Condor pool, with the exception of the computers in chi lab.Important binaries (Or: How I learned to stop worrying and use the Condor)
Condor has many executables. This section gives a brief rundown of the most important ones for normal users.condor_status (Pool status)
For the purposes of checking the status of a node in the Condor pool, condor_status is the binary to use. This can be run at any machine that has Condor installed but only lists the status of machines that can run jobs. That is, it lists whether or not a node is available.- Unclaimed means a Condor job can be set to run on that machine.
- Owner means that resources are currently being utilised by a non-Condor process.
- Claimed means that a Condor job is currently running on that machine.
condor_submit (Job submission)
To submit a job to the Condor pool, one must first create a submit file, which is a simple text file that tells Condor the details of the job as well as its machine requirements. An example submit file could look like:Executable = hello
Universe = vanilla
Output = hello.out
Log = hello.log
QueueAs is probably obvious, this would run the hello executable in the vanilla universe, write output to hello.out and keep a log of its progress within the Condor pool in hello.log. It's a good idea to be explicit with the paths of any files specified in the submit file.
Once the file is created (let's say we created hello.submit in this case, though it can be called anything), one should use condor_submit to send the job to the queue. In this example, the command would be:
condor_submit hello.submitDoing this will tell the server of the details of the job, and the server would then try to match it up with an appropriate machine. To check the status of a job, one can check the log file specified in the submit file, or use condor_q.
The official project homepage has more examples of job submission files.
condor_q and condor_rm (Job management)
To check the status of queued jobs sent from the current machine, one can use condor_q. The -long and -analyze switches can be used for more detail.To remove a job from the queue, condor_rm followed by the job number (which can be determined from the log file or by using condor_q) can be used.
condor_compile (Checkpointing)
Some jobs can be relinked with condor_compile to take advantage of Condor's checkpointing system in the standard universe. An example of this is the submission of hello. One would start with a C++ code file hello.c. One would then create the object file:g++ hello.c -c hello.o -o hello.oThe next step would be to relink with condor_compile
condor_compile g++ hello.o -o helloThe job can then be submitted using the standard universe, and it will be checkpointed and migrated to another machine whenever a machine enters the owner status for an extended period of time.
Universes
Condor offers several different universes in which to run jobs. The universe in which a user wants a job run must be specified in the submit file. The following universes are available:- Standard for jobs that can be relinked with condor_compile
- Vanilla for jobs that can't be relinked
- PVM
- MPI
- Globus
- Java for running Java jobs on a JVM
- Scheduler