Development Environment

In this section we present an overview of installed development tools.

Overview of available development tools and libraries.

Tools Versions Description
Editors vi, Vim,nano Source code development
Compilers intel,pgi,gnu,cuda Create executable
Parallel intelmpi, openmpi Parallel executable
Library Archiver ar Library archiver
Make make,cmake Automation
Debuggers gdb,gdb-ia,pgdbg,ddd,cuda-gdb Debugging
Profilers VTune,Scalasca,mpiP,gprof,pgprof Performance analysis
Module module tcl Environment management

Compilers

The available compilers are accessed by loading the appropriate module.

To list all available compilers you can use the following module command and check for “compilers” and “parallel”

module avail

--------------------------------- /apps/modulefiles/compilers ----------------------------------
binutils/2.25         gdb/7.9.1(default)    gnu/5.4.0             intel/16.0.2
binutils/2.26         gnu/4.9.2(default)    gnu/6.1.0             intel/16.0.3
cuda/6.5.14           gnu/4.9.3             intel/15.0.3(default) java/1.8.0(default)
cuda/7.0.28           gnu/5.1.0             intel/15.0.6          pgi/15.5
cuda/7.5.18(default)  gnu/5.2.0             intel/16.0.0          pgi/16.4
gdb/7.11.1            gnu/5.3.0             intel/16.0.1          pgi/16.5(default)

---------------------------------- /apps/modulefiles/parallel ----------------------------------
intelmpi/5.0.3(default) openmpi/1.10.0/gnu      openmpi/1.8.5/intel
intelmpi/5.1.1          openmpi/1.10.0/intel    openmpi/1.8.7/gnu
intelmpi/5.1.2          openmpi/1.10.1/gnu      openmpi/1.8.7/intel
intelmpi/5.1.3          openmpi/1.10.1/intel    openmpi/1.8.8
mpiP/3.4.1(default)     openmpi/1.10.2/gnu      padb/3.3
mvapich2/gnu/2.2.2a     openmpi/1.10.2/intel    scalasca/2.2.2
mvapich2/intel/2.2.2a   openmpi/1.8.5/gnu       scalasca/2.3.1(default)

Compilers Overview

Overview of available compilers and supported languages.

Language GNU INTEL PORTLAND File Extension
C gcc icc pgcc .c
C++ g++ icpc pgc++ .cpp, .cc, .C, .cxx
FORTRAN gfortran ifort pgfortran .f,.for, .ftn, .f90, .f95, .fpp

INTEL compiler suite

Intel® Compilers help create C, C++ and Fortran applications that can take full advantage of the advanced hardware capabilities available in Intel® processors and co-processors. They also simplify that development by providing high level parallel models and built-in features like explicit vectorization and optimization reports.

To use Intel’s compiler suite, load intel module.

module load intel/15.0.3

icc --version
icc (ICC) 15.0.3 20150407

Optimization flags

Option Description
-help advanced Show options that control optimizations
-O[0-3] Optimizer level
-fast Maximize speed
-Os Optimize for size
-opt-repot[n] Generates an optimization report
-x[target] Generates specialized code for any Intel® processor that supports the instruction set specified by target. AVX,…
-m[target] Generates specialized code for any Intel processor or compatible, non-Intel processor that supports the instruction set specified by target. AVX,…
-xhost Generates instruction sets up to the highest that is supported by the compilation host
-parallel The auto-parallelizer detects simply structured loops that may be safely executed in parallel.
-ip, -ipo Permits inlining and other interprocedural optimizations
-finline-functions This option enables function inlining
-unroll, unroll-agressive Unroll loops
-[no-]prec-div Improves [reduces] precision of floating point divides. This may slightly degrade [improve] performance.
-fno-alias Assumes no aliasing in the program. Off by default.
-[no]restrict Enables [disables] pointer disambiguation with the restrict keyword.

Suggested optimization flags

icc -O3 -xCORE-AVX-I

Check the full list of optimize options

GNU Compiler Collection

The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Java, Ada, and Go, as well as libraries for these languages (libstdc++, libgcj,…).

GCC was originally written as the compiler for the GNU operating system. The GNU system was developed to be 100% free software, free in the sense that it respects the user’s freedom.

To use GNU’s compilers collection, load gnu module.

module avail gnu

------------------- /apps/modulefiles/compilers ------------------
gnu/4.9.2(default) gnu/4.9.3          gnu/5.1.0          gnu/5.2.0
module load gnu

gcc --version
gcc (GCC) 4.9.2

Optimization flags

Option Description
–help=optimizers Show options that control optimizations
-Q -O[number] –help=optimizers Show optimizers for each level O0-3
-O[0-3] optimizer level
-Ofast enables all -O3 optimizations plus -ffast-math, fno-protect-parens and -fstack-arrays
-Os Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
-ffast-math it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
march=[cputype] GCC generate code for specific processor: native,ivybridge,core-avx-i,…
mtune=[cputype] Optimize code for specific processor: native,ivybridge,core-avx-i,.. (march=native implies mtune=native)
-Q -march=native –help=target Show details
-m[target] Enable use of instructions sets, -mavx, …
-fomit-frame-pointer Don’t keep the frame pointer in a register for functions that don’t need one.
-fp-model[name] May enhance the consistency of floating point results by restricting certain optimizations.
-fno-alias/-fno-fnalias Assumes no aliasing(within functions) in the program
-finline-functions Consider all functions for inlining
-funroll-loops Unroll loops whose number of iterations can be determined at compile time

Suggested optimization flags

gcc -O3 -mAVX -march=ivybridge 

Check the full list of optimize options

PGI Compilers & Tools

The Portland Group, Inc. or PGI is a company that produces a set of commercially available Fortran, C and C++ compilers for high-performance computing systems.

To use PGI’s compilers, load pgi module.

module load pgi/15.5

pgcc -V
pgc++ -V
pgfortran -V

Optimization flags

Option Description
-help=opt Show options that control optimizations
-O[0-4] Optimizer level
-fast Overall maximize
-Minfo Display compile time optimization listings.
-Munroll Uroll loops
-Minline Inline functions
-Mvect Vectorization
-Mconcur Auto-Parallelization
-Mipa=fast,inline Interprocedural analysis (IPA)

Suggested optimization flags

pgcc -O4 -fast -Mvect

Check the full list of optimize options

Compiler Options

Option Description
-c Compile or assemble the source files, but do not link.
-o filename Name the outputfile filename.
-g Produces symbolic debug information.
-pg Generate extra code to write profile information suitable for the analysis program gprof.
-D[name] Predefine [name] as a macro for the preprocessor, with definition 1.
-I[dir] Specifies an additional directory [dir] to search for include files.
-l[library] Search for [library] when linking.
-static Force static linkage
-L[dir] Search for libraries in a specified directory [dir].
-fpic Generate position-independent code.
–version,-v Show version number.
-help,-h Show help information, and list flags
-std=[standard] Conform to a specific language [standard]

Optimization Flags x86_64 processors

To achieve optimal performance of your application, please consider using appropriate compiler flags. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for inter-procedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

Here is an overview of the available optimization options for each compiler suite.

Optimization Level Description
-O0 No optimization (default), generates unoptimized code but has the fastest compilation time. Debugging support if using -g
-O1 Moderate optimization, optimize for size
-O2 Optimize even more, maximize speed
-O3 Full optimization, more aggressive loop and memory-access optimizations.
-O4 (PGI only) Performs all level optimizations and enables hoisting of guarded invariant floating point expressions.
-Os (Intel, GNU) Optimize space usage (code and data) of resulting program.
-Ofast Maximizes speed

Here is a list of some important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.

Please notice that optimization flags not always guarantee faster execution code time.

Option GNU Option Intel Option PGI Description
-O[0-3] -O[0-3] -O[0-4] Optimizer level
-Os -Os - Optimize space
-Ofast -fast -fast Maximizes speed across the entire program.
-mtune,-march=native -xHost - Compiler generates instructions for the ihighest instruction set available on the host processor. (AVX)
-funroll-loops -unroll/-unroll-agressive -Munroll Unroll loops
- -opt-streaming-stores -Mnontemporal Specifies whether streaming stores are generated
-finline-functions -ip -Minline/-Mrecursvie The compiler heuristically decides which functions are worth inlining.
- -ip0 -Minline -Mextract Permits inlining and other interprocedural optimizations among multiple source files.

Vectorization

The compiler will automatically check for vectorization opportunities when higher optimization levels are used. ARIS is capable AVX (Advanced Vector Extensions) recommended for Intel’s Ivy bridge processors.

Option GNU Option Intel Option PGI Description
-O[2-3], -Ofast -O[2-3], -fast -O[2-4], -fast Enable
-ftree-vectorize -vec, -simd -Mvect=simd Specific enable
-fno-tree-vectorize -no-vec -Mnovect Disable
-march=native -xHost -fast Support AVX
-mavx -xAVX - type of SIMD instructions

Full otpimization lists for each compiler.

OpenMP

OpenMP (Open Multi-Processing) is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems. To enable omp directives the appropriate option must be used.

OpenMP Flags

Option GNU Option Intel Option PGI Description
-fopenmp -openmp -mp Enable omp directives
-floop-parallelize-all -parallel -Mconcur Performs shared-memory auto-parallelization.

OpenMP Envirnment Variables

Variable Default Description
OMP_NUM_THREADS Number of processors (20) Max num. threads
OMP_SCHEDULE {INTEL} STATIC, no chunk size specified, {GNU} DYNAMIC, chunk size =1 run-time schedule
OMP_DYNAMIC FALSE dynamic adjustment of number of threads
OMP_NESTED FALSE nested parallelism
OMP_MAX_ACTIVE_LEVELS unlimited maximum number of nested parallel region
OMP_STACKSIZE {INTEL 4M} {GNU System dependent} number of bytes to allocate for each OpenMP thread
OMP_THREAD_LIMIT NO Limits the number of simultaneously executing threads in an OpenMP program
GNU
GOMP_CPU_AFFINITY system dependent Bind threads to specific CPUs
OMP_WAIT_POLICY threads wait actively for a short time before waiting passively How waiting threads are handled
GOMP_DEBUG Enable debugging output
GOMP_STACKSIZE System dependent Set default thread stack size
OMP_PROC_BIND True Whether theads may be moved between CPUs
INTEL
KMP_ALL_THREADS No enforced limit Limits the number of simultaneously executing threads in an OpenMP program.
KMP_BLOCKTIME 200 milliseconds Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
KMP_LIBRARY throughput Selects the OpenMP run-time library execution mode. The options for the variable value are throughput, turnaround, and serial.
KMP_STACKSIZE 4m Sets the number of bytes to allocate for each OpenMP* thread to use as the private stack for the thread.
KMP_AFFINITY noverbose,respect,granularity=core Enables run-time library to bind threads to physical processing units.

GNU libgomp

INTEL openmp

MPI

Message Passing Interface (MPI) is a standardized and portable message-passing parallel programming model to function on distributed memory systems.

The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in different computer programming languages such as Fortran, C, C++ and Java. There are several well-tested and efficient implementations of MPI.

ARIS supported MPI implementations

MPI implementations

Hardware interface MPI flavour module version Execute
infiniband/ shared-memory Intel MPI intelmpi 5.0.3 / 5.1.1 srun
infiniband/ shared-memory OpenMPI openmpi 1.8.5 / 1.8.7 / 1.8.8 / 1.10.0 / 1.10.1 srun

Intel MPI library

Intel MPI library

Intel MPI library website

Reference manual

Available versions

module avail intelmpi

--------------- /apps/modulefiles/parallel ---------------
intelmpi/5.0.3(default) intelmpi/5.1.1
Language GNU INTEL PORTLAND
C mpicc mpiicc mpicc -cc=pgcc
C++ mpicxx mpicpc mpicc -cxx=pgc++
FORTRAN mpif90 mpiifort mpif90-fc=pgfortran

To select underlying compiler use the flag -cc=[compiler]

For example in order to use Intel MPI with gcc/4.9.2 underlying compiler

module load gnu/4.9.2
module load intelmpi/5.0.3

Now you can check the underlying compiler options , link flags and libraries.

mpicc -show

gcc -I/apps/compilers/intel/impi/5.0.3.048/intel64/include
-L/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt
-L/apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker --enable-new-dtags
-Xlinker -rpath -Xlinker
/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt -Xlinker -rpath
-Xlinker /apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker -rpath
-Xlinker /opt/intel/mpi-rt/5.0/intel64/lib/release_mt -Xlinker -rpath -Xlinker
/opt/intel/mpi-rt/5.0/intel64/lib -lmpifort -lmpi -lmpigi -ldl -lrt -lpthread

Respectively Intel MPI with icc/15.0.3

module load intel/15.0.3
module load intelmpi/5.0.3
mpiicc -show

icc -I/apps/compilers/intel/impi/5.0.3.048/intel64/include
-L/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt
-L/apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker --enable-new-dtags
-Xlinker -rpath -Xlinker
/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt -Xlinker -rpath
-Xlinker /apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker -rpath
-Xlinker /opt/intel/mpi-rt/5.0/intel64/lib/release_mt -Xlinker -rpath -Xlinker
/opt/intel/mpi-rt/5.0/intel64/lib -lmpifort -lmpi -lmpigi -ldl -lrt -lpthread

Launch programs

Command srun launches mpi programs.

DON’T USE mpirun AND mpiexec

Intel MPI Runtime Environment Variables

Control MPI behavior.

Variable Value Description
I_MPI_DEBUG 0-5 Print out debugging informationi when MPI program starts running.
I_MPI_PLATFORM ivb Optimize for the Intel® Xeon® Processors formerly code named Ivy Bridge
I_MPI_PERHOST N/allcores Define process layout, N processes per node, allcores on a node.
I_MPI_PIN on/off Turn on/off process pinning.
I_MPI_PIN_PROCESSOR_LIST Get Help Define a processor subset and the mapping rules for MPI processes within this subset.
I_MPI__PIN_DOMAIN Get Help control process pinning for hybrid MPI/OpenMP applications
I_MPI_FABRICS shm:dapl Network fabrics to be used
I_MPI_EAGER_THRESHOLD [nbytes] Change the eager/rendezvous message size threshold for all devices, default 262144 bytes
  • If the I_MPI_PIN_DOMAIN environment variable is defined, then the I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.

OpenMPI

Open Source High Performance Computing

OpenMPI website

Available versions:

module avail openmpi

------------ /apps/modulefiles/parallel---------------
openmpi/1.10.1/gnu(default) openmpi/1.10.1/intel
openmpi/1.10.0/gnu          openmpi/1.10.0/intel
openmpi/1.8.8
openmpi/1.8.7/gnu           openmpi/1.8.7/intel
openmpi/1.8.5/gnu           openmpi/1.8.5/intel

For each version there two compiled flavors for openmpi, gnu and intel

To select underlying compiler just load accordingly the module flavor you need.

Language wrapper GNU module INTEL module
C mpicc openmpi/[version]/gnu openmpi/[version]/intel
C++ mpicxx openmpi/[version]/gnu openmpi/[version]/intel
FORTRAN mpif90 openmpi/[version]/gnu openmpi/[version]/intel

For example in order to use openMPI with gcc/4.9.2 underlying compiler consider load the gnu openmpi flavor

module load gnu/4.9.2
module load openmpi/1.8.5/gnu

mpifort -show
gfortran -I/apps/parallel/openmpi/1.8.5/gnu/include -pthread
-I/apps/parallel/openmpi/1.8.5/gnu/lib -Wl,-rpath
-Wl,/apps/parallel/openmpi/1.8.5/gnu/lib -Wl,--enable-new-dtags
-L/apps/parallel/openmpi/1.8.5/gnu/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr
-lmpi_mpifh -lmpi

Respectively openMPI with icc/15.0.3

module load intel/15.0.3
module load openmpi/1.8.5/intel

mpicc -show
icc -I/apps/parallel/openmpi/1.8.5/intel/include -pthread -Wl,-rpath
-Wl,/apps/parallel/openmpi/1.8.5/intel/lib -Wl,--enable-new-dtags
-L/apps/parallel/openmpi/1.8.5/intel/lib -lmpi

Launch programs

Command srun launches mpi programs.

DON’T USE mpirun AND mpiexec

General run-time tuning

Intel Xeon Phi

To use Intel’s xeon phi coprocessor, load the intel compiler module.

module load intel

Offload programming model

Currently only offload programming model is supported on ARIS supercomputer.

Control number of OMP threads

export MIC_ENV_PREFIX=MIC

## 60 physical cores 4 hardware threads
export MIC_OMP_NUM_THREADS=240

Technical Information (Intel Xeon Phi 7120p)

Output of micinfo command on one PHI node with 2 coprocessors.

MicInfo Utility Log


    System Info
        HOST OS         : Linux
        OS Version      : 2.6.32-573.18.1.el6.x86_64
        Driver Version      : 3.7.1-1
        MPSS Version        : 3.7.1
        Host Physical Memory    : 64317 MB

Device No: 0, Device Name: mic0

    Version
        Flash Version        : 2.1.02.0391
        SMC Firmware Version     : 1.17.6900
        SMC Boot Loader Version  : 1.8.4326
        Coprocessor OS Version   : 2.6.38.8+mpss3.7.1
        Device Serial Number     : ADKC60900153

    Board
        Vendor ID        : 0x8086
        Device ID        : 0x225c
        Subsystem ID         : 0x7d95
        Coprocessor Stepping ID  : 2
        PCIe Width       : x16
        PCIe Speed       : 5 GT/s
        PCIe Max payload size    : 256 bytes
        PCIe Max read req size   : 4096 bytes
        Coprocessor Model    : 0x01
        Coprocessor Model Ext    : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family   : 0x0b
        Coprocessor Family Ext   : 0x00
        Coprocessor Stepping     : C0
        Board SKU        : C0PRQ-7120 P/A/X/D
        ECC Mode         : Enabled
        SMC HW Revision      : Product 300W Passive CS

    Cores
        Total No of Active Cores : 61
        Voltage          : 0 uV
        Frequency        : 1238095 kHz

    Thermal
        Fan Speed Control    : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 34 C

    GDDR
        GDDR Vendor      : Samsung
        GDDR Version         : 0x6
        GDDR Density         : 4096 Mb
        GDDR Size        : 15872 MB
        GDDR Technology      : GDDR5 
        GDDR Speed       : 5.500000 GT/s 
        GDDR Frequency       : 2750000 kHz
        GDDR Voltage         : 1501000 uV

Device No: 1, Device Name: mic1

    Version
        Flash Version        : 2.1.02.0391
        SMC Firmware Version     : 1.17.6900
        SMC Boot Loader Version  : 1.8.4326
        Coprocessor OS Version   : 2.6.38.8+mpss3.7.1
        Device Serial Number     : ADKC60900052

    Board
        Vendor ID        : 0x8086
        Device ID        : 0x225c
        Subsystem ID         : 0x7d95
        Coprocessor Stepping ID  : 2
        PCIe Width       : x16
        PCIe Speed       : 5 GT/s
        PCIe Max payload size    : 256 bytes
        PCIe Max read req size   : 4096 bytes
        Coprocessor Model    : 0x01
        Coprocessor Model Ext    : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family   : 0x0b
        Coprocessor Family Ext   : 0x00
        Coprocessor Stepping     : C0
        Board SKU        : C0PRQ-7120 P/A/X/D
        ECC Mode         : Enabled
        SMC HW Revision      : Product 300W Passive CS

    Cores
        Total No of Active Cores : 61
        Voltage          : 0 uV
        Frequency        : 1238095 kHz

    Thermal
        Fan Speed Control    : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 36 C

    GDDR
        GDDR Vendor      : Samsung
        GDDR Version         : 0x6
        GDDR Density         : 4096 Mb
        GDDR Size        : 15872 MB
        GDDR Technology      : GDDR5 
        GDDR Speed       : 5.500000 GT/s 
        GDDR Frequency       : 2750000 kHz
        GDDR Voltage         : 1501000 uV

NVIDIA CUDA

CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach known as GPGPU. The CUDA platform is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.

To use NVIDIA’s compiler suite, load cuda module.

module avail cuda

------------------- /apps/modulefiles/compilers ------------------
cuda/6.5.14          cuda/7.0.28          cuda/7.5.18(default)
module load cuda

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

Example output of the deviceQuery sample on a GPU node whith 2 Tesla K40.

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K40m"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla K40m"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : No
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m
Result = PASS

Debuggers

INTEL gdb-ia
PGI pgdbg
GNU gdb,ddd
CUDA cuda-gdb

GDB

GDB, the GNU Project debugger, allows you to see what is going on `inside’ another program while it executes – or what another program was doing at the moment it crashed.

Compile your code with debugging information

gcc [flags] -g [source file] -o [output filer]

Start session

gdb ./a.out

GDB commands

Command Description
help display a list of named classes of commands
run start the program
attach attach to a running process outside GDB
step go to the next source line, will step into a function/subroutine
next go to the next source line, function/subroutine calls are executed without stepping into them
continue continue executing
break sets breakpoint
watch set a watchpoint to stop execution when the value of a variable or an expression changes
list display (default 10) lines of source surrounding the current line
print print value of a variable
backtrace displays a stack frame for each active subroutine
detach detach from a process
quit exit GDB

To execute shell commands during the debugging session issue shell in front of the command, e.g.

(gdb) shell ls -l

GDB-IA

Debugger provided by intel.

module load intel

PGDBG

PGDBG® is a graphical debugger for Linux, OS X and Windows capable of debugging serial and parallel programs including MPI process-parallel, OpenMP thread-parallel and hybrid MPI/OpenMP applications. PGDBG can debug programs on SMP workstations, servers, distributed-memory clusters and hybrid clusters where each node contains multiple 64-bit or 32-bit multicore processors.

module load pgi

DDD

GNU DDD is a graphical front-end for command-line debuggers such as GDB, DBX, WDB, Ladebug, JDB, XDB, the Perl debugger, the bash debugger bashdb, the GNU Make debugger remake, or the Python debugger pydb. Besides ``usual’‘ front-end features such as viewing source texts, DDD has become famous through its interactive graphical data display, where data structures are displayed as graphs. For more information (and more screenshots), see the DDD Manual.

CUDA-GDB

Performance Analysis

Performance Analysis Tools Version
INTEL VTUNE 2015
PGI pgprof 2015
GNU gprof 2.25
Scalasca 2.2.2
mpiP 3.4.1
nvprof

GPROF

GNU profiler gprof

module load binutils

GNU gprof is a widely used profiling tool for Unix systems which produces an execution profile of C and Fortran programs. It can show the application call graph, which represents the calling relationships between functions in the program, and the percentage of total execution time spent in each function.

Compile and Link your code with -pg flag

gprof [flags] -g [source_file] -o [output_file] -pg

Invoke gprof to analyse and display profiling results.

gprof options [executable-file] gmon.out bb-data [yet-more-profile-data-files...] [> outfile]

Output Options

  • --flat-profile : prints the total amount of time spent and the number of calls to each function
  • --graph: prints the call-graph analysis from the application execution
  • --annotated-source : prints profiling information next to the original source code

GPROF manual

VTUNE Amplifier XE

Whether you are tuning for the first time or doing advanced performance optimization, Intel® VTune™ Amplifier XE provides the data needed to meet a wide variety of tuning needs. Collect a rich set of performance data for hotspots, threading, OpenCL, locks and waits, DirectX*, bandwidth, and more. But good data is not enough. You need tools to mine the data and make it easy to interpret. Powerful analysis lets you sort, filter, and visualize results on the timeline and on your source. Identify serial time and load imbalance. Select slow Open MP instances and discover why they are slow.

module load intel

GUI

amplxe-gui

Please use gui only on login nodes to analyze your report

Command Line

You can use command line tool to analyze your program on compute nodes. amplxe-cl

Check help information

amplxe-cl -help
amplxe-cl -help collect

Perform hotspot analysis

amplxe-cl -collect hotspots -result-dir mydir /home/test/myprogram

Check result summary

amplxe-cl -R summary -r mydir

Vtune web

SCALASCA

Scalasca is a software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes.

module load scalasca/2.2.2

Scalasca Documentation

mpiP

mpiP is a lightweight profiling library for MPI applications. Because it only collects statistical information about MPI functions, mpiP generates considerably less overhead and much less data than tracing tools. All the information captured by mpiP is task-local. It only uses communication during report generation, typically at the end of the experiment, to merge results from all of the tasks into one output file.

module load mpip

mpiP Documentation

nvprof

You can you use the nvprof to collect and view profiling data from the command-line, either import them to visual profiler nvpp.

Command line nvprof

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview

nvprof <GPU_EXECUTABLE>

Remote profiling with nvprof

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#unique_307789860

nvprof --export-profile timeline.nvprof <GPU_EXECUTABLE>

To view collected timeline data, the timeline.nvprof file can be imported into nvvp as described in Import Single-Process nvprof Session - See more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#import-session

MPI Profiling

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#mpi-profiling

The nvprof profiler can be used to profile individual MPI processes.

srun nvprof -o output.%h.%p.%q{SLURM_PROCID} <GPU_EXECUTABLE>