Development Environment¶

In this section we present an overview of installed development tools.

Overview of available development tools and libraries.

Tools	Versions	Description
Editors	vi, Vim,nano	Source code development
Compilers	intel,pgi,gnu,cuda	Create executable
Parallel	intelmpi, openmpi	Parallel executable
Library Archiver	ar	Library archiver
Make	make,cmake	Automation
Debuggers	gdb,gdb-ia,pgdbg,ddd,cuda-gdb	Debugging
Profilers	VTune,Scalasca,mpiP,gprof,pgprof	Performance analysis
Module	module tcl	Environment management

Compilers¶

The available compilers are accessed by loading the appropriate module.

To list all available compilers you can use the following module command and check for “compilers” and “parallel”

module avail

--------------------------------- /apps/modulefiles/compilers ----------------------------------
binutils/2.25         gdb/7.9.1(default)    gnu/5.4.0             intel/16.0.2
binutils/2.26         gnu/4.9.2(default)    gnu/6.1.0             intel/16.0.3
cuda/6.5.14           gnu/4.9.3             intel/15.0.3(default) java/1.8.0(default)
cuda/7.0.28           gnu/5.1.0             intel/15.0.6          pgi/15.5
cuda/7.5.18(default)  gnu/5.2.0             intel/16.0.0          pgi/16.4
gdb/7.11.1            gnu/5.3.0             intel/16.0.1          pgi/16.5(default)

---------------------------------- /apps/modulefiles/parallel ----------------------------------
intelmpi/5.0.3(default) openmpi/1.10.0/gnu      openmpi/1.8.5/intel
intelmpi/5.1.1          openmpi/1.10.0/intel    openmpi/1.8.7/gnu
intelmpi/5.1.2          openmpi/1.10.1/gnu      openmpi/1.8.7/intel
intelmpi/5.1.3          openmpi/1.10.1/intel    openmpi/1.8.8
mpiP/3.4.1(default)     openmpi/1.10.2/gnu      padb/3.3
mvapich2/gnu/2.2.2a     openmpi/1.10.2/intel    scalasca/2.2.2
mvapich2/intel/2.2.2a   openmpi/1.8.5/gnu       scalasca/2.3.1(default)

Compilers Overview¶

Overview of available compilers and supported languages.

Language	GNU	INTEL	PORTLAND	File Extension
C	gcc	icc	pgcc	.c
C++	g++	icpc	pgc++	.cpp, .cc, .C, .cxx
FORTRAN	gfortran	ifort	pgfortran	.f,.for, .ftn, .f90, .f95, .fpp

INTEL compiler suite¶

Intel® Compilers help create C, C++ and Fortran applications that can take full advantage of the advanced hardware capabilities available in Intel® processors and co-processors. They also simplify that development by providing high level parallel models and built-in features like explicit vectorization and optimization reports.

To use Intel’s compiler suite, load intel module.

module load intel/15.0.3

icc --version
icc (ICC) 15.0.3 20150407

Optimization flags¶

Option	Description
-help advanced	Show options that control optimizations
-O[0-3]	Optimizer level
-fast	Maximize speed
-Os	Optimize for size
-opt-repot[n]	Generates an optimization report
-x[target]	Generates specialized code for any Intel® processor that supports the instruction set specified by target. AVX,…
-m[target]	Generates specialized code for any Intel processor or compatible, non-Intel processor that supports the instruction set specified by target. AVX,…
-xhost	Generates instruction sets up to the highest that is supported by the compilation host
-parallel	The auto-parallelizer detects simply structured loops that may be safely executed in parallel.
-ip, -ipo	Permits inlining and other interprocedural optimizations
-finline-functions	This option enables function inlining
-unroll, unroll-agressive	Unroll loops
-[no-]prec-div	Improves [reduces] precision of floating point divides. This may slightly degrade [improve] performance.
-fno-alias	Assumes no aliasing in the program. Off by default.
-[no]restrict	Enables [disables] pointer disambiguation with the restrict keyword.

Suggested optimization flags¶

icc -O3 -xCORE-AVX-I

Check the full list of optimize options

GNU Compiler Collection¶

The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Java, Ada, and Go, as well as libraries for these languages (libstdc++, libgcj,…).

GCC was originally written as the compiler for the GNU operating system. The GNU system was developed to be 100% free software, free in the sense that it respects the user’s freedom.

To use GNU’s compilers collection, load gnu module.

module avail gnu

------------------- /apps/modulefiles/compilers ------------------
gnu/4.9.2(default) gnu/4.9.3          gnu/5.1.0          gnu/5.2.0

module load gnu

gcc --version
gcc (GCC) 4.9.2

Optimization flags¶

Option	Description
–help=optimizers	Show options that control optimizations
-Q -O[number] –help=optimizers	Show optimizers for each level O0-3
-O[0-3]	optimizer level
-Ofast	enables all -O3 optimizations plus -ffast-math, fno-protect-parens and -fstack-arrays
-Os	Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
-ffast-math	it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
march=[cputype]	GCC generate code for specific processor: native,ivybridge,core-avx-i,…
mtune=[cputype]	Optimize code for specific processor: native,ivybridge,core-avx-i,.. (march=native implies mtune=native)
-Q -march=native –help=target	Show details
-m[target]	Enable use of instructions sets, -mavx, …
-fomit-frame-pointer	Don’t keep the frame pointer in a register for functions that don’t need one.
-fp-model[name]	May enhance the consistency of floating point results by restricting certain optimizations.
-fno-alias/-fno-fnalias	Assumes no aliasing(within functions) in the program
-finline-functions	Consider all functions for inlining
-funroll-loops	Unroll loops whose number of iterations can be determined at compile time

Suggested optimization flags¶

gcc -O3 -mAVX -march=ivybridge

Check the full list of optimize options

PGI Compilers & Tools¶

The Portland Group, Inc. or PGI is a company that produces a set of commercially available Fortran, C and C++ compilers for high-performance computing systems.

To use PGI’s compilers, load pgi module.

module load pgi/15.5

pgcc -V
pgc++ -V
pgfortran -V

Optimization flags¶

Option	Description
-help=opt	Show options that control optimizations
-O[0-4]	Optimizer level
-fast	Overall maximize
-Minfo	Display compile time optimization listings.
-Munroll	Uroll loops
-Minline	Inline functions
-Mvect	Vectorization
-Mconcur	Auto-Parallelization
-Mipa=fast,inline	Interprocedural analysis (IPA)

Suggested optimization flags¶

pgcc -O4 -fast -Mvect

Check the full list of optimize options

Compiler Options¶

Option	Description
-c	Compile or assemble the source files, but do not link.
-o	filename Name the outputfile filename.
-g	Produces symbolic debug information.
-pg	Generate extra code to write profile information suitable for the analysis program gprof.
-D[name]	Predefine [name] as a macro for the preprocessor, with definition 1.
-I[dir]	Specifies an additional directory [dir] to search for include files.
-l[library]	Search for [library] when linking.
-static	Force static linkage
-L[dir]	Search for libraries in a specified directory [dir].
-fpic	Generate position-independent code.
–version,-v	Show version number.
-help,-h	Show help information, and list flags
-std=[standard]	Conform to a specific language [standard]

Optimization Flags x86_64 processors¶

To achieve optimal performance of your application, please consider using appropriate compiler flags. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for inter-procedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.

Here is an overview of the available optimization options for each compiler suite.

Optimization Level	Description
-O0	No optimization (default), generates unoptimized code but has the fastest compilation time. Debugging support if using -g
-O1	Moderate optimization, optimize for size
-O2	Optimize even more, maximize speed
-O3	Full optimization, more aggressive loop and memory-access optimizations.
-O4	(PGI only) Performs all level optimizations and enables hoisting of guarded invariant floating point expressions.
-Os	(Intel, GNU) Optimize space usage (code and data) of resulting program.
-Ofast	Maximizes speed

Here is a list of some important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.

Please notice that optimization flags not always guarantee faster execution code time.

Option GNU	Option Intel	Option PGI	Description
-O[0-3]	-O[0-3]	-O[0-4]	Optimizer level
-Os	-Os	-	Optimize space
-Ofast	-fast	-fast	Maximizes speed across the entire program.
-mtune,-march=native	-xHost	-	Compiler generates instructions for the ihighest instruction set available on the host processor. (AVX)
-funroll-loops	-unroll/-unroll-agressive	-Munroll	Unroll loops
-	-opt-streaming-stores	-Mnontemporal	Specifies whether streaming stores are generated
-finline-functions	-ip	-Minline/-Mrecursvie	The compiler heuristically decides which functions are worth inlining.
-	-ip0	-Minline -Mextract	Permits inlining and other interprocedural optimizations among multiple source files.

Vectorization

The compiler will automatically check for vectorization opportunities when higher optimization levels are used. ARIS is capable AVX (Advanced Vector Extensions) recommended for Intel’s Ivy bridge processors.

Option GNU	Option Intel	Option PGI	Description
-O[2-3], -Ofast	-O[2-3], -fast	-O[2-4], -fast	Enable
-ftree-vectorize	-vec, -simd	-Mvect=simd	Specific enable
-fno-tree-vectorize	-no-vec	-Mnovect	Disable
-march=native	-xHost	-fast	Support AVX
-mavx	-xAVX	-	type of SIMD instructions

Full otpimization lists for each compiler.

OpenMP¶

OpenMP (Open Multi-Processing) is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems. To enable omp directives the appropriate option must be used.

OpenMP Flags

Option GNU	Option Intel	Option PGI	Description
-fopenmp	-openmp	-mp	Enable omp directives
-floop-parallelize-all	-parallel	-Mconcur	Performs shared-memory auto-parallelization.

OpenMP Envirnment Variables

Variable	Default	Description
OMP_NUM_THREADS	Number of processors (20)	Max num. threads
OMP_SCHEDULE	{INTEL} STATIC, no chunk size specified, {GNU} DYNAMIC, chunk size =1	run-time schedule
OMP_DYNAMIC	FALSE	dynamic adjustment of number of threads
OMP_NESTED	FALSE	nested parallelism
OMP_MAX_ACTIVE_LEVELS	unlimited	maximum number of nested parallel region
OMP_STACKSIZE	{INTEL 4M} {GNU System dependent}	number of bytes to allocate for each OpenMP thread
OMP_THREAD_LIMIT	NO	Limits the number of simultaneously executing threads in an OpenMP program

GNU
GOMP_CPU_AFFINITY	system dependent	Bind threads to specific CPUs
OMP_WAIT_POLICY	threads wait actively for a short time before waiting passively	How waiting threads are handled
GOMP_DEBUG		Enable debugging output
GOMP_STACKSIZE	System dependent	Set default thread stack size
OMP_PROC_BIND	True	Whether theads may be moved between CPUs

INTEL
KMP_ALL_THREADS	No enforced limit	Limits the number of simultaneously executing threads in an OpenMP program.
KMP_BLOCKTIME	200 milliseconds	Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.
KMP_LIBRARY	throughput	Selects the OpenMP run-time library execution mode. The options for the variable value are throughput, turnaround, and serial.
KMP_STACKSIZE	4m	Sets the number of bytes to allocate for each OpenMP* thread to use as the private stack for the thread.
KMP_AFFINITY	noverbose,respect,granularity=core	Enables run-time library to bind threads to physical processing units.

GNU libgomp

INTEL openmp

MPI¶

Message Passing Interface (MPI) is a standardized and portable message-passing parallel programming model to function on distributed memory systems.

The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in different computer programming languages such as Fortran, C, C++ and Java. There are several well-tested and efficient implementations of MPI.

ARIS supported MPI implementations

MPI implementations

Hardware interface	MPI flavour	module	version	Execute
infiniband/ shared-memory	Intel MPI	intelmpi	5.0.3 / 5.1.1	srun
infiniband/ shared-memory	OpenMPI	openmpi	1.8.5 / 1.8.7 / 1.8.8 / 1.10.0 / 1.10.1	srun

Intel MPI library¶

Intel MPI library

Intel MPI library website

Reference manual

Available versions

module avail intelmpi

--------------- /apps/modulefiles/parallel ---------------
intelmpi/5.0.3(default) intelmpi/5.1.1

Language	GNU	INTEL	PORTLAND
C	mpicc	mpiicc	mpicc -cc=pgcc
C++	mpicxx	mpicpc	mpicc -cxx=pgc++
FORTRAN	mpif90	mpiifort	mpif90-fc=pgfortran

To select underlying compiler use the flag -cc=[compiler]

For example in order to use Intel MPI with gcc/4.9.2 underlying compiler

module load gnu/4.9.2
module load intelmpi/5.0.3

Now you can check the underlying compiler options , link flags and libraries.

mpicc -show

gcc -I/apps/compilers/intel/impi/5.0.3.048/intel64/include
-L/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt
-L/apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker --enable-new-dtags
-Xlinker -rpath -Xlinker
/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt -Xlinker -rpath
-Xlinker /apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker -rpath
-Xlinker /opt/intel/mpi-rt/5.0/intel64/lib/release_mt -Xlinker -rpath -Xlinker
/opt/intel/mpi-rt/5.0/intel64/lib -lmpifort -lmpi -lmpigi -ldl -lrt -lpthread

Respectively Intel MPI with icc/15.0.3

module load intel/15.0.3
module load intelmpi/5.0.3

mpiicc -show

icc -I/apps/compilers/intel/impi/5.0.3.048/intel64/include
-L/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt
-L/apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker --enable-new-dtags
-Xlinker -rpath -Xlinker
/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt -Xlinker -rpath
-Xlinker /apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker -rpath
-Xlinker /opt/intel/mpi-rt/5.0/intel64/lib/release_mt -Xlinker -rpath -Xlinker
/opt/intel/mpi-rt/5.0/intel64/lib -lmpifort -lmpi -lmpigi -ldl -lrt -lpthread

Launch programs¶

Command srun launches mpi programs.

DON’T USE mpirun AND mpiexec

Intel MPI Runtime Environment Variables¶

Control MPI behavior.

Variable	Value	Description
I_MPI_DEBUG	0-5	Print out debugging informationi when MPI program starts running.
I_MPI_PLATFORM	ivb	Optimize for the Intel® Xeon® Processors formerly code named Ivy Bridge
I_MPI_PERHOST	N/allcores	Define process layout, N processes per node, allcores on a node.
I_MPI_PIN	on/off	Turn on/off process pinning.
I_MPI_PIN_PROCESSOR_LIST	Get Help	Define a processor subset and the mapping rules for MPI processes within this subset.
I_MPI__PIN_DOMAIN	Get Help	control process pinning for hybrid MPI/OpenMP applications
I_MPI_FABRICS	shm:dapl	Network fabrics to be used
I_MPI_EAGER_THRESHOLD	[nbytes]	Change the eager/rendezvous message size threshold for all devices, default 262144 bytes

If the I_MPI_PIN_DOMAIN environment variable is defined, then the I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.

OpenMPI¶

Open Source High Performance Computing

OpenMPI website

Available versions:

module avail openmpi

------------ /apps/modulefiles/parallel---------------
openmpi/1.10.1/gnu(default) openmpi/1.10.1/intel
openmpi/1.10.0/gnu          openmpi/1.10.0/intel
openmpi/1.8.8
openmpi/1.8.7/gnu           openmpi/1.8.7/intel
openmpi/1.8.5/gnu           openmpi/1.8.5/intel

For each version there two compiled flavors for openmpi, gnu and intel

To select underlying compiler just load accordingly the module flavor you need.

Language	wrapper	GNU module	INTEL module
C	mpicc	openmpi/[version]/gnu	openmpi/[version]/intel
C++	mpicxx	openmpi/[version]/gnu	openmpi/[version]/intel
FORTRAN	mpif90	openmpi/[version]/gnu	openmpi/[version]/intel

For example in order to use openMPI with gcc/4.9.2 underlying compiler consider load the gnu openmpi flavor

module load gnu/4.9.2
module load openmpi/1.8.5/gnu

mpifort -show
gfortran -I/apps/parallel/openmpi/1.8.5/gnu/include -pthread
-I/apps/parallel/openmpi/1.8.5/gnu/lib -Wl,-rpath
-Wl,/apps/parallel/openmpi/1.8.5/gnu/lib -Wl,--enable-new-dtags
-L/apps/parallel/openmpi/1.8.5/gnu/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr
-lmpi_mpifh -lmpi

Respectively openMPI with icc/15.0.3

module load intel/15.0.3
module load openmpi/1.8.5/intel

mpicc -show
icc -I/apps/parallel/openmpi/1.8.5/intel/include -pthread -Wl,-rpath
-Wl,/apps/parallel/openmpi/1.8.5/intel/lib -Wl,--enable-new-dtags
-L/apps/parallel/openmpi/1.8.5/intel/lib -lmpi

Launch programs¶

Command srun launches mpi programs.

DON’T USE mpirun AND mpiexec

General run-time tuning

Intel Xeon Phi¶

To use Intel’s xeon phi coprocessor, load the intel compiler module.

module load intel

Offload programming model¶

Currently only offload programming model is supported on ARIS supercomputer.

Control number of OMP threads¶

export MIC_ENV_PREFIX=MIC

## 60 physical cores 4 hardware threads
export MIC_OMP_NUM_THREADS=240

Technical Information (Intel Xeon Phi 7120p)¶

Output of micinfo command on one PHI node with 2 coprocessors.

MicInfo Utility Log


    System Info
        HOST OS         : Linux
        OS Version      : 2.6.32-573.18.1.el6.x86_64
        Driver Version      : 3.7.1-1
        MPSS Version        : 3.7.1
        Host Physical Memory    : 64317 MB

Device No: 0, Device Name: mic0

    Version
        Flash Version        : 2.1.02.0391
        SMC Firmware Version     : 1.17.6900
        SMC Boot Loader Version  : 1.8.4326
        Coprocessor OS Version   : 2.6.38.8+mpss3.7.1
        Device Serial Number     : ADKC60900153

    Board
        Vendor ID        : 0x8086
        Device ID        : 0x225c
        Subsystem ID         : 0x7d95
        Coprocessor Stepping ID  : 2
        PCIe Width       : x16
        PCIe Speed       : 5 GT/s
        PCIe Max payload size    : 256 bytes
        PCIe Max read req size   : 4096 bytes
        Coprocessor Model    : 0x01
        Coprocessor Model Ext    : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family   : 0x0b
        Coprocessor Family Ext   : 0x00
        Coprocessor Stepping     : C0
        Board SKU        : C0PRQ-7120 P/A/X/D
        ECC Mode         : Enabled
        SMC HW Revision      : Product 300W Passive CS

    Cores
        Total No of Active Cores : 61
        Voltage          : 0 uV
        Frequency        : 1238095 kHz

    Thermal
        Fan Speed Control    : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 34 C

    GDDR
        GDDR Vendor      : Samsung
        GDDR Version         : 0x6
        GDDR Density         : 4096 Mb
        GDDR Size        : 15872 MB
        GDDR Technology      : GDDR5 
        GDDR Speed       : 5.500000 GT/s 
        GDDR Frequency       : 2750000 kHz
        GDDR Voltage         : 1501000 uV

Device No: 1, Device Name: mic1

    Version
        Flash Version        : 2.1.02.0391
        SMC Firmware Version     : 1.17.6900
        SMC Boot Loader Version  : 1.8.4326
        Coprocessor OS Version   : 2.6.38.8+mpss3.7.1
        Device Serial Number     : ADKC60900052

    Board
        Vendor ID        : 0x8086
        Device ID        : 0x225c
        Subsystem ID         : 0x7d95
        Coprocessor Stepping ID  : 2
        PCIe Width       : x16
        PCIe Speed       : 5 GT/s
        PCIe Max payload size    : 256 bytes
        PCIe Max read req size   : 4096 bytes
        Coprocessor Model    : 0x01
        Coprocessor Model Ext    : 0x00
        Coprocessor Type     : 0x00
        Coprocessor Family   : 0x0b
        Coprocessor Family Ext   : 0x00
        Coprocessor Stepping     : C0
        Board SKU        : C0PRQ-7120 P/A/X/D
        ECC Mode         : Enabled
        SMC HW Revision      : Product 300W Passive CS

    Cores
        Total No of Active Cores : 61
        Voltage          : 0 uV
        Frequency        : 1238095 kHz

    Thermal
        Fan Speed Control    : N/A
        Fan RPM          : N/A
        Fan PWM          : N/A
        Die Temp         : 36 C

    GDDR
        GDDR Vendor      : Samsung
        GDDR Version         : 0x6
        GDDR Density         : 4096 Mb
        GDDR Size        : 15872 MB
        GDDR Technology      : GDDR5 
        GDDR Speed       : 5.500000 GT/s 
        GDDR Frequency       : 2750000 kHz
        GDDR Voltage         : 1501000 uV

NVIDIA CUDA¶

CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach known as GPGPU. The CUDA platform is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.

To use NVIDIA’s compiler suite, load cuda module.

module avail cuda

------------------- /apps/modulefiles/compilers ------------------
cuda/6.5.14          cuda/7.0.28          cuda/7.5.18(default)

module load cuda

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

Example output of the deviceQuery sample on a GPU node whith 2 Tesla K40.

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K40m"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Tesla K40m"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 11520 MBytes (12079136768 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Max Clock rate:                            876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 131 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : No
> Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m
Result = PASS

Debuggers¶


INTEL	gdb-ia
PGI	pgdbg
GNU	gdb,ddd
CUDA	cuda-gdb

GDB¶

GDB, the GNU Project debugger, allows you to see what is going on `inside’ another program while it executes – or what another program was doing at the moment it crashed.

Compile your code with debugging information

gcc [flags] -g [source file] -o [output filer]

Start session

gdb ./a.out

GDB commands¶

Command	Description
help	display a list of named classes of commands
run	start the program
attach	attach to a running process outside GDB
step	go to the next source line, will step into a function/subroutine
next	go to the next source line, function/subroutine calls are executed without stepping into them
continue	continue executing
break	sets breakpoint
watch	set a watchpoint to stop execution when the value of a variable or an expression changes
list	display (default 10) lines of source surrounding the current line
print	print value of a variable
backtrace	displays a stack frame for each active subroutine
detach	detach from a process
quit	exit GDB

To execute shell commands during the debugging session issue shell in front of the command, e.g.

(gdb) shell ls -l

GDB-IA¶

Debugger provided by intel.

module load intel

PGDBG¶

PGDBG® is a graphical debugger for Linux, OS X and Windows capable of debugging serial and parallel programs including MPI process-parallel, OpenMP thread-parallel and hybrid MPI/OpenMP applications. PGDBG can debug programs on SMP workstations, servers, distributed-memory clusters and hybrid clusters where each node contains multiple 64-bit or 32-bit multicore processors.

module load pgi

PGDB

DDD¶

GNU DDD is a graphical front-end for command-line debuggers such as GDB, DBX, WDB, Ladebug, JDB, XDB, the Perl debugger, the bash debugger bashdb, the GNU Make debugger remake, or the Python debugger pydb. Besides ``usual’‘ front-end features such as viewing source texts, DDD has become famous through its interactive graphical data display, where data structures are displayed as graphs. For more information (and more screenshots), see the DDD Manual.

CUDA-GDB¶

Performance Analysis¶

Performance Analysis Tools	Version
INTEL VTUNE	2015
PGI pgprof	2015
GNU gprof	2.25
Scalasca	2.2.2
mpiP	3.4.1
nvprof

GPROF¶

GNU profiler gprof

module load binutils

GNU gprof is a widely used profiling tool for Unix systems which produces an execution profile of C and Fortran programs. It can show the application call graph, which represents the calling relationships between functions in the program, and the percentage of total execution time spent in each function.

Compile and Link your code with -pg flag

gprof [flags] -g [source_file] -o [output_file] -pg

Invoke gprof to analyse and display profiling results.

gprof options [executable-file] gmon.out bb-data [yet-more-profile-data-files...] [> outfile]

Output Options

--flat-profile : prints the total amount of time spent and the number of calls to each function
--graph: prints the call-graph analysis from the application execution
--annotated-source : prints profiling information next to the original source code

GPROF manual

VTUNE Amplifier XE¶

Whether you are tuning for the first time or doing advanced performance optimization, Intel® VTune™ Amplifier XE provides the data needed to meet a wide variety of tuning needs. Collect a rich set of performance data for hotspots, threading, OpenCL, locks and waits, DirectX*, bandwidth, and more. But good data is not enough. You need tools to mine the data and make it easy to interpret. Powerful analysis lets you sort, filter, and visualize results on the timeline and on your source. Identify serial time and load imbalance. Select slow Open MP instances and discover why they are slow.

module load intel

GUI¶

amplxe-gui

Please use gui only on login nodes to analyze your report

Command Line¶

You can use command line tool to analyze your program on compute nodes. amplxe-cl

Check help information

amplxe-cl -help

amplxe-cl -help collect

Perform hotspot analysis

amplxe-cl -collect hotspots -result-dir mydir /home/test/myprogram

Check result summary

amplxe-cl -R summary -r mydir

Vtune web

SCALASCA¶

Scalasca is a software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes.

module load scalasca/2.2.2

Scalasca Documentation

mpiP¶

mpiP is a lightweight profiling library for MPI applications. Because it only collects statistical information about MPI functions, mpiP generates considerably less overhead and much less data than tracing tools. All the information captured by mpiP is task-local. It only uses communication during report generation, typically at the end of the experiment, to merge results from all of the tasks into one output file.

module load mpip

mpiP Documentation

nvprof¶

You can you use the nvprof to collect and view profiling data from the command-line, either import them to visual profiler nvpp.

Command line `nvprof`¶

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview

nvprof <GPU_EXECUTABLE>

Remote profiling with `nvprof`¶

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#unique_307789860

nvprof --export-profile timeline.nvprof <GPU_EXECUTABLE>

To view collected timeline data, the timeline.nvprof file can be imported into nvvp as described in Import Single-Process nvprof Session - See more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#import-session

MPI Profiling¶

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#mpi-profiling

The nvprof profiler can be used to profile individual MPI processes.

srun nvprof -o output.%h.%p.%q{SLURM_PROCID} <GPU_EXECUTABLE>

Development Environment¶

Compilers¶

Compilers Overview¶

INTEL compiler suite¶

Optimization flags¶

Suggested optimization flags¶

GNU Compiler Collection¶

Optimization flags¶

Suggested optimization flags¶

PGI Compilers & Tools¶

Optimization flags¶

Suggested optimization flags¶

Compiler Options¶

Optimization Flags x86_64 processors¶

OpenMP¶

MPI¶

Intel MPI library¶

Launch programs¶

Intel MPI Runtime Environment Variables¶

OpenMPI¶

Launch programs¶

Intel Xeon Phi¶

Offload programming model¶

Control number of OMP threads¶

Technical Information (Intel Xeon Phi 7120p)¶

NVIDIA CUDA¶

Debuggers¶

GDB¶

GDB commands¶

GDB-IA¶

PGDBG¶

DDD¶

CUDA-GDB¶

Performance Analysis¶

GPROF¶

VTUNE Amplifier XE¶

GUI¶

Command Line¶

SCALASCA¶

mpiP¶

nvprof¶

Command line nvprof¶

Remote profiling with nvprof¶

MPI Profiling¶

Command line `nvprof`¶

Remote profiling with `nvprof`¶