Job Monitoring

Job Monitoring¶

Show jobs queue¶

To determine what jobs exist on the system use

:$ squeue --all
JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)

JOBID: job id
PARTITION: partition (use sinfo to list all available partitions)
NAME: partition name
USER: username
ST: STate column,
- R: Running
- PD: PenDing
- TO: TimedOut
- S: Suspended
- CD: Completed
- CA: CAncelled
- F: Failed
- NF: Node Failure

To list jobs only for your user, use

squeue -u username

Check job scheduled time to start

squeue --start

squeue -o "%.8i %.9P %.10j %.10u %.8T %.5C %.4D %.6m %.10l %.10M %.10L %.16R"

Please check squeue man for more information.

man squeue

Job information¶

To view detailed job information use

:$ scontrol show job 11841

JobId=11841 JobName=rungmx.sh
   UserId=ntell(1000) GroupId=ntell(1000)
   Priority=4294900666 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=14:01:51 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2015-07-28T00:51:21 EligibleTime=2015-07-28T00:51:21
   StartTime=2015-07-28T00:51:22 EndTime=2015-07-30T00:51:22
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=all AllocNode:Sid=login01:5379
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node[001-072]
   BatchHost=node001
   NumNodes=72 NumCPUs=1440 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=20:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/users/staff/ntell/Runs/Oxides/1273/rungmx.sh
   WorkDir=/users/staff/ntell/Runs/Oxides/1273
   StdErr=/users/staff/ntell/Runs/Oxides/1273/slurm-11841.out
   StdIn=/dev/null
   StdOut=/users/staff/ntell/Runs/Oxides/1273/slurm-11841.out

Pending Jobs¶

Common reasons for awaiting jobs.


Dependency	This job is waiting for a dependent job to complete.
NodeDown	A node required by the job is down.
PartitionDown	The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintenance. Note that this message may be displayed for a time even after the system is back up.
Priority	One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours.
ReqNodeNotAvail	No nodes can be found satisfying your limits, for instance because maintenance is scheduled and the job can not finish before it
Reservation	The job is waiting for its advanced reservation to become available.
Resources	The job is waiting for resources (nodes) to become available and will run when Slurm finds enough free nodes.
SystemFailure	Failure of the SLURM system, a file system, the network, etc