Job Monitoring
Job Monitoring¶
Show jobs queue¶
To determine what jobs exist on the system use
:$ squeue --all
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
- JOBID: job id
- PARTITION: partition (use
sinfo
to list all available partitions) - NAME: partition name
- USER: username
- ST: STate column,
- R: Running
- PD: PenDing
- TO: TimedOut
- S: Suspended
- CD: Completed
- CA: CAncelled
- F: Failed
- NF: Node Failure
To list jobs only for your user, use
squeue -u username
Check job scheduled time to start
squeue --start
squeue -o "%.8i %.9P %.10j %.10u %.8T %.5C %.4D %.6m %.10l %.10M %.10L %.16R"
Please check squeue
man for more information.
man squeue
Job information¶
To view detailed job information use
:$ scontrol show job 11841
JobId=11841 JobName=rungmx.sh
UserId=ntell(1000) GroupId=ntell(1000)
Priority=4294900666 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=14:01:51 TimeLimit=2-00:00:00 TimeMin=N/A
SubmitTime=2015-07-28T00:51:21 EligibleTime=2015-07-28T00:51:21
StartTime=2015-07-28T00:51:22 EndTime=2015-07-30T00:51:22
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=all AllocNode:Sid=login01:5379
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node[001-072]
BatchHost=node001
NumNodes=72 NumCPUs=1440 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=20:0:*:* CoreSpec=*
MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/users/staff/ntell/Runs/Oxides/1273/rungmx.sh
WorkDir=/users/staff/ntell/Runs/Oxides/1273
StdErr=/users/staff/ntell/Runs/Oxides/1273/slurm-11841.out
StdIn=/dev/null
StdOut=/users/staff/ntell/Runs/Oxides/1273/slurm-11841.out
Pending Jobs¶
Common reasons for awaiting jobs.
Dependency | This job is waiting for a dependent job to complete. |
NodeDown | A node required by the job is down. |
PartitionDown | The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintenance. Note that this message may be displayed for a time even after the system is back up. |
Priority | One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours. |
ReqNodeNotAvail | No nodes can be found satisfying your limits, for instance because maintenance is scheduled and the job can not finish before it |
Reservation | The job is waiting for its advanced reservation to become available. |
Resources | The job is waiting for resources (nodes) to become available and will run when Slurm finds enough free nodes. |
SystemFailure | Failure of the SLURM system, a file system, the network, etc |