Slurm Batch System
The Slurm system manages the department batch queue. Slurm runs jobs on the departmental and research compute nodes.
For a video lecture discussing the CS Cluster and slurm, visit the Cluster Class link.
You can also access the repository directly at the following location:
The CS Slurm setup organizes compute resources into two queues.
- The compsci queue contains all the nodes with access to the department NFS filesystem where most user home and project directories live. If unsure, use the compsci queue.
- The GPU hosts are in the compsci-gpu queue. Please do not submit general batch jobs here as they should be reserved for GPU computing.
An additional queues exist to hold computers owned by specific research groups.
- Donald Lab users can use the grisman queue for priority access to the grisman cluster.
- Sam Wiseman's group can use the nlplab or wiseman partitions for priority access.
- Michael Reiter's group can use the skynet partition for priority access to the skynet cluster.
- Lisa Will's group can use the wills partition for priority access to the wills cluster.
- Bhuwan Dhingra's group can use the nlplab or bhuwan partitions for priority access.
- Cynthia Rudin's group can use the rudin partition for priority access.
In order to submit to the research partitions users will either need to specify the account to use:
sbatch -A PARTITIONNAME ....
or switch their default account
sacctmgr modify user NetID set DefaultAccount=PARTITIONNAME
All interation with the queuing system must be done from one of the cluster head nodes. To access the head nodes, ssh to login.cs.duke.edu using your NetID and NetID password.
For the basics of Slurm operation, please see the following links
Job scripts
All jobs submitted to Slurm must be shell scripts, and must be submitted from one of the cluster head nodes. Slurm will scan the script text for option flags. The same flags can be on the srun command or embedded in the script. Lines in the script beginning with #SBATCH will be interpretted as containing slurm flags.
The following job runs the program hostname. The script passes slurm the -D flag to run the job in the current working directory where sbatch was executed. This is the equivalent of running: sbatch -D . job.sh.
#!/bin/sh #SBATCH --time=1 hostname
Defaults
By default, each job will get a default time limit of 4 days, and 30G of memory per node. If you need more, you will need to specify that in the parameters for the batch script.
scontrol show partition compsci PartitionName=compsci AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=YES QoS=N/A DefaultTime=4-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=90-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=linux[1-50] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=1600 TotalNodes=50 SelectTypeParameters=NONE DefMemPerNode=30000 MaxMemPerNode=UNLIMITED
Examples
List running jobs
squeue
List jobs belonging to a user
squeue -A user
List running jobs
squeue -u user -t RUNNING
List compute partitions
sinfo
List compute nodes
sinfo -N
Show a node's resource attributes
sinfo -Nel
Submit a job
sbatch script.sh
Interactive session on a GPU host
srun -p compsci-gpu --gres=gpu:1 --pty bash -i
Detailed job information
scontrol show jobid -dd jobid
Using an OR constraint to select between multiple GPUs
sbatch -p compsci-gpu --constraint="2080rtx|k80”
Request a specific type and number of GPU(s)
sbatch -p compsci-gpu --gres=gpu:2080rtx:1
Direct a job to a linux41 where the GPU has 10G of RAM while passing the GPU_SET variable 2 to pass to program
export GPU_SET=2;sbatch -w linux41 --mem-per-gpu=10g -p compsci-gpu job.sh
Delete a job
scancel jobid
Here is a sample script. This script will run in the compsci partition
#!/bin/csh -f
#SBATCH --mem=1G
#SBATCH --output=matlab.out
#SBATCH --error=slurm.err
matlab -nodisplay myfile.m
Please be aware that compute cluster machines are not backed up. Users should copy any important data to filesystems that are backed up to avoid losing data. In addition, try to be cognizant that this is a shared resource. Please minimize the network traffic for shared resources like disk space. If you need to read and write lots of data, please copy that to local disks, compute the results, and store the results on longer term storage.