Job Monitoring

Job Monitoring

Show jobs queue

To determine what jobs exist on the system use

:$ squeue --all
JOBID PARTITION  NAME  USER ST  TIME NODES NODELIST(REASON)
  • JOBID: job id
  • PARTITION: partition (use sinfo to list all available partitions)
  • NAME: partition name
  • USER: username
  • ST: STate column,
    • R: Running
    • PD: PenDing
    • TO: TimedOut
    • S: Suspended
    • CD: Completed
    • CA: CAncelled
    • F: Failed
    • NF: Node Failure

To list jobs only for your user, use

squeue -u username

Check job scheduled time to start

squeue --start
squeue -o "%.8i %.9P %.10j %.10u %.8T %.5C %.4D %.6m %.10l %.10M %.10L %.16R"

Please check squeue man for more information.

man squeue

Job information

To view detailed job information use

:$ scontrol show job 11841

JobId=11841 JobName=rungmx.sh
   UserId=ntell(1000) GroupId=ntell(1000)
   Priority=4294900666 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=14:01:51 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2015-07-28T00:51:21 EligibleTime=2015-07-28T00:51:21
   StartTime=2015-07-28T00:51:22 EndTime=2015-07-30T00:51:22
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=all AllocNode:Sid=login01:5379
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node[001-072]
   BatchHost=node001
   NumNodes=72 NumCPUs=1440 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=20:0:*:* CoreSpec=*
   MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/users/staff/ntell/Runs/Oxides/1273/rungmx.sh
   WorkDir=/users/staff/ntell/Runs/Oxides/1273
   StdErr=/users/staff/ntell/Runs/Oxides/1273/slurm-11841.out
   StdIn=/dev/null
   StdOut=/users/staff/ntell/Runs/Oxides/1273/slurm-11841.out

Pending Jobs

Common reasons for awaiting jobs.

Dependency This job is waiting for a dependent job to complete.
NodeDown A node required by the job is down.
PartitionDown The partition (queue) required by this job is in a DOWN state and temporarily accepting no jobs, for instance because of maintenance. Note that this message may be displayed for a time even after the system is back up.
Priority One or more higher priority jobs exist for this partition or advanced reservation. Other jobs in the queue have higher priority than yours.
ReqNodeNotAvail No nodes can be found satisfying your limits, for instance because maintenance is scheduled and the job can not finish before it
Reservation The job is waiting for its advanced reservation to become available.
Resources The job is waiting for resources (nodes) to become available and will run when Slurm finds enough free nodes.
SystemFailure Failure of the SLURM system, a file system, the network, etc