Development Environment¶
In this section we present an overview of installed development tools.
Overview of available development tools and libraries.
Tools | Versions | Description |
---|---|---|
Editors | vi, Vim,nano | Source code development |
Compilers | intel,pgi,gnu,cuda | Create executable |
Parallel | intelmpi, openmpi | Parallel executable |
Library Archiver | ar | Library archiver |
Make | make,cmake | Automation |
Debuggers | gdb,gdb-ia,pgdbg,ddd,cuda-gdb | Debugging |
Profilers | VTune,Scalasca,mpiP,gprof,pgprof | Performance analysis |
Module | module tcl | Environment management |
Compilers¶
The available compilers are accessed by loading the appropriate module.
To list all available compilers you can use the following module command and check for “compilers” and “parallel”
module avail --------------------------------- /apps/modulefiles/compilers ---------------------------------- binutils/2.25 gdb/7.9.1(default) gnu/5.4.0 intel/16.0.2 binutils/2.26 gnu/4.9.2(default) gnu/6.1.0 intel/16.0.3 cuda/6.5.14 gnu/4.9.3 intel/15.0.3(default) java/1.8.0(default) cuda/7.0.28 gnu/5.1.0 intel/15.0.6 pgi/15.5 cuda/7.5.18(default) gnu/5.2.0 intel/16.0.0 pgi/16.4 gdb/7.11.1 gnu/5.3.0 intel/16.0.1 pgi/16.5(default) ---------------------------------- /apps/modulefiles/parallel ---------------------------------- intelmpi/5.0.3(default) openmpi/1.10.0/gnu openmpi/1.8.5/intel intelmpi/5.1.1 openmpi/1.10.0/intel openmpi/1.8.7/gnu intelmpi/5.1.2 openmpi/1.10.1/gnu openmpi/1.8.7/intel intelmpi/5.1.3 openmpi/1.10.1/intel openmpi/1.8.8 mpiP/3.4.1(default) openmpi/1.10.2/gnu padb/3.3 mvapich2/gnu/2.2.2a openmpi/1.10.2/intel scalasca/2.2.2 mvapich2/intel/2.2.2a openmpi/1.8.5/gnu scalasca/2.3.1(default)
Compilers Overview¶
Overview of available compilers and supported languages.
Language | GNU | INTEL | PORTLAND | File Extension |
---|---|---|---|---|
C | gcc | icc | pgcc | .c |
C++ | g++ | icpc | pgc++ | .cpp, .cc, .C, .cxx |
FORTRAN | gfortran | ifort | pgfortran | .f,.for, .ftn, .f90, .f95, .fpp |
INTEL compiler suite¶
Intel® Compilers help create C, C++ and Fortran applications that can take full advantage of the advanced hardware capabilities available in Intel® processors and co-processors. They also simplify that development by providing high level parallel models and built-in features like explicit vectorization and optimization reports.
To use Intel’s compiler suite, load intel module.
module load intel/15.0.3 icc --version icc (ICC) 15.0.3 20150407
Optimization flags¶
Option | Description |
---|---|
-help advanced | Show options that control optimizations |
-O[0-3] | Optimizer level |
-fast | Maximize speed |
-Os | Optimize for size |
-opt-repot[n] | Generates an optimization report |
-x[target] | Generates specialized code for any Intel® processor that supports the instruction set specified by target. AVX,… |
-m[target] | Generates specialized code for any Intel processor or compatible, non-Intel processor that supports the instruction set specified by target. AVX,… |
-xhost | Generates instruction sets up to the highest that is supported by the compilation host |
-parallel | The auto-parallelizer detects simply structured loops that may be safely executed in parallel. |
-ip, -ipo | Permits inlining and other interprocedural optimizations |
-finline-functions | This option enables function inlining |
-unroll, unroll-agressive | Unroll loops |
-[no-]prec-div | Improves [reduces] precision of floating point divides. This may slightly degrade [improve] performance. |
-fno-alias | Assumes no aliasing in the program. Off by default. |
-[no]restrict | Enables [disables] pointer disambiguation with the restrict keyword. |
Suggested optimization flags¶
icc -O3 -xCORE-AVX-I
Check the full list of optimize options
GNU Compiler Collection¶
The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Java, Ada, and Go, as well as libraries for these languages (libstdc++, libgcj,…).
GCC was originally written as the compiler for the GNU operating system. The GNU system was developed to be 100% free software, free in the sense that it respects the user’s freedom.
To use GNU’s compilers collection, load gnu module.
module avail gnu ------------------- /apps/modulefiles/compilers ------------------ gnu/4.9.2(default) gnu/4.9.3 gnu/5.1.0 gnu/5.2.0
module load gnu gcc --version gcc (GCC) 4.9.2
Optimization flags¶
Option | Description |
---|---|
–help=optimizers | Show options that control optimizations |
-Q -O[number] –help=optimizers | Show optimizers for each level O0-3 |
-O[0-3] | optimizer level |
-Ofast | enables all -O3 optimizations plus -ffast-math, fno-protect-parens and -fstack-arrays |
-Os | Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size. |
-ffast-math | it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications. |
march=[cputype] | GCC generate code for specific processor: native,ivybridge,core-avx-i,… |
mtune=[cputype] | Optimize code for specific processor: native,ivybridge,core-avx-i,.. (march=native implies mtune=native) |
-Q -march=native –help=target | Show details |
-m[target] | Enable use of instructions sets, -mavx, … |
-fomit-frame-pointer | Don’t keep the frame pointer in a register for functions that don’t need one. |
-fp-model[name] | May enhance the consistency of floating point results by restricting certain optimizations. |
-fno-alias/-fno-fnalias | Assumes no aliasing(within functions) in the program |
-finline-functions | Consider all functions for inlining |
-funroll-loops | Unroll loops whose number of iterations can be determined at compile time |
Suggested optimization flags¶
gcc -O3 -mAVX -march=ivybridge
Check the full list of optimize options
PGI Compilers & Tools¶
The Portland Group, Inc. or PGI is a company that produces a set of commercially available Fortran, C and C++ compilers for high-performance computing systems.
To use PGI’s compilers, load pgi module.
module load pgi/15.5 pgcc -V pgc++ -V pgfortran -V
Optimization flags¶
Option | Description |
---|---|
-help=opt | Show options that control optimizations |
-O[0-4] | Optimizer level |
-fast | Overall maximize |
-Minfo | Display compile time optimization listings. |
-Munroll | Uroll loops |
-Minline | Inline functions |
-Mvect | Vectorization |
-Mconcur | Auto-Parallelization |
-Mipa=fast,inline | Interprocedural analysis (IPA) |
Suggested optimization flags¶
pgcc -O4 -fast -Mvect
Check the full list of optimize options
Compiler Options¶
Option | Description |
---|---|
-c | Compile or assemble the source files, but do not link. |
-o | filename Name the outputfile filename. |
-g | Produces symbolic debug information. |
-pg | Generate extra code to write profile information suitable for the analysis program gprof. |
-D[name] | Predefine [name] as a macro for the preprocessor, with definition 1. |
-I[dir] | Specifies an additional directory [dir] to search for include files. |
-l[library] | Search for [library] when linking. |
-static | Force static linkage |
-L[dir] | Search for libraries in a specified directory [dir]. |
-fpic | Generate position-independent code. |
–version,-v | Show version number. |
-help,-h | Show help information, and list flags |
-std=[standard] | Conform to a specific language [standard] |
Optimization Flags x86_64 processors¶
To achieve optimal performance of your application, please consider using appropriate compiler flags. Generally, the highest impact can be achieved by selecting an appropriate optimization level, by targeting the architecture of the computer (CPU, cache, memory system), and by allowing for inter-procedural analysis (inlining, etc.). There is no set of options that gives the highest speed-up for all applications. Consequently, different combinations have to be explored.
Here is an overview of the available optimization options for each compiler suite.
Optimization Level | Description |
---|---|
-O0 | No optimization (default), generates unoptimized code but has the fastest compilation time. Debugging support if using -g |
-O1 | Moderate optimization, optimize for size |
-O2 | Optimize even more, maximize speed |
-O3 | Full optimization, more aggressive loop and memory-access optimizations. |
-O4 | (PGI only) Performs all level optimizations and enables hoisting of guarded invariant floating point expressions. |
-Os | (Intel, GNU) Optimize space usage (code and data) of resulting program. |
-Ofast | Maximizes speed |
Here is a list of some important compiler options that affect application performance, based on the target architecture, application behavior, loading, and debugging.
Please notice that optimization flags not always guarantee faster execution code time.
Option GNU | Option Intel | Option PGI | Description |
---|---|---|---|
-O[0-3] | -O[0-3] | -O[0-4] | Optimizer level |
-Os | -Os | - | Optimize space |
-Ofast | -fast | -fast | Maximizes speed across the entire program. |
-mtune,-march=native | -xHost | - | Compiler generates instructions for the ihighest instruction set available on the host processor. (AVX) |
-funroll-loops | -unroll/-unroll-agressive | -Munroll | Unroll loops |
- | -opt-streaming-stores | -Mnontemporal | Specifies whether streaming stores are generated |
-finline-functions | -ip | -Minline/-Mrecursvie | The compiler heuristically decides which functions are worth inlining. |
- | -ip0 | -Minline -Mextract | Permits inlining and other interprocedural optimizations among multiple source files. |
Vectorization
The compiler will automatically check for vectorization opportunities when higher optimization levels are used. ARIS is capable AVX (Advanced Vector Extensions) recommended for Intel’s Ivy bridge processors.
Option GNU | Option Intel | Option PGI | Description |
---|---|---|---|
-O[2-3], -Ofast | -O[2-3], -fast | -O[2-4], -fast | Enable |
-ftree-vectorize | -vec, -simd | -Mvect=simd | Specific enable |
-fno-tree-vectorize | -no-vec | -Mnovect | Disable |
-march=native | -xHost | -fast | Support AVX |
-mavx | -xAVX | - | type of SIMD instructions |
Full otpimization lists for each compiler.
OpenMP¶
OpenMP (Open Multi-Processing) is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems. To enable omp directives the appropriate option must be used.
OpenMP Flags
Option GNU | Option Intel | Option PGI | Description |
---|---|---|---|
-fopenmp | -openmp | -mp | Enable omp directives |
-floop-parallelize-all | -parallel | -Mconcur | Performs shared-memory auto-parallelization. |
OpenMP Envirnment Variables
Variable | Default | Description |
---|---|---|
OMP_NUM_THREADS | Number of processors (20) | Max num. threads |
OMP_SCHEDULE | {INTEL} STATIC, no chunk size specified, {GNU} DYNAMIC, chunk size =1 | run-time schedule |
OMP_DYNAMIC | FALSE | dynamic adjustment of number of threads |
OMP_NESTED | FALSE | nested parallelism |
OMP_MAX_ACTIVE_LEVELS | unlimited | maximum number of nested parallel region |
OMP_STACKSIZE | {INTEL 4M} {GNU System dependent} | number of bytes to allocate for each OpenMP thread |
OMP_THREAD_LIMIT | NO | Limits the number of simultaneously executing threads in an OpenMP program |
GNU | ||
---|---|---|
GOMP_CPU_AFFINITY | system dependent | Bind threads to specific CPUs |
OMP_WAIT_POLICY | threads wait actively for a short time before waiting passively | How waiting threads are handled |
GOMP_DEBUG | Enable debugging output | |
GOMP_STACKSIZE | System dependent | Set default thread stack size |
OMP_PROC_BIND | True | Whether theads may be moved between CPUs |
INTEL | ||
---|---|---|
KMP_ALL_THREADS | No enforced limit | Limits the number of simultaneously executing threads in an OpenMP program. |
KMP_BLOCKTIME | 200 milliseconds | Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping. |
KMP_LIBRARY | throughput | Selects the OpenMP run-time library execution mode. The options for the variable value are throughput, turnaround, and serial. |
KMP_STACKSIZE | 4m | Sets the number of bytes to allocate for each OpenMP* thread to use as the private stack for the thread. |
KMP_AFFINITY | noverbose,respect,granularity=core | Enables run-time library to bind threads to physical processing units. |
MPI¶
Message Passing Interface (MPI) is a standardized and portable message-passing parallel programming model to function on distributed memory systems.
The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in different computer programming languages such as Fortran, C, C++ and Java. There are several well-tested and efficient implementations of MPI.
ARIS supported MPI implementations
MPI implementations
Hardware interface | MPI flavour | module | version | Execute |
---|---|---|---|---|
infiniband/ shared-memory | Intel MPI | intelmpi | 5.0.3 / 5.1.1 | srun |
infiniband/ shared-memory | OpenMPI | openmpi | 1.8.5 / 1.8.7 / 1.8.8 / 1.10.0 / 1.10.1 | srun |
Intel MPI library¶
Intel MPI library
Available versions
module avail intelmpi --------------- /apps/modulefiles/parallel --------------- intelmpi/5.0.3(default) intelmpi/5.1.1
Language | GNU | INTEL | PORTLAND |
---|---|---|---|
C | mpicc | mpiicc | mpicc -cc=pgcc |
C++ | mpicxx | mpicpc | mpicc -cxx=pgc++ |
FORTRAN | mpif90 | mpiifort | mpif90-fc=pgfortran |
To select underlying compiler use the flag -cc=[compiler]
For example in order to use Intel MPI
with gcc/4.9.2
underlying compiler
module load gnu/4.9.2 module load intelmpi/5.0.3
Now you can check the underlying compiler options , link flags and libraries.
mpicc -show gcc -I/apps/compilers/intel/impi/5.0.3.048/intel64/include -L/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt -L/apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt -Xlinker -rpath -Xlinker /apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker -rpath -Xlinker /opt/intel/mpi-rt/5.0/intel64/lib/release_mt -Xlinker -rpath -Xlinker /opt/intel/mpi-rt/5.0/intel64/lib -lmpifort -lmpi -lmpigi -ldl -lrt -lpthread
Respectively Intel MPI
with icc/15.0.3
module load intel/15.0.3 module load intelmpi/5.0.3
mpiicc -show icc -I/apps/compilers/intel/impi/5.0.3.048/intel64/include -L/apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt -L/apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /apps/compilers/intel/impi/5.0.3.048/intel64/lib/release_mt -Xlinker -rpath -Xlinker /apps/compilers/intel/impi/5.0.3.048/intel64/lib -Xlinker -rpath -Xlinker /opt/intel/mpi-rt/5.0/intel64/lib/release_mt -Xlinker -rpath -Xlinker /opt/intel/mpi-rt/5.0/intel64/lib -lmpifort -lmpi -lmpigi -ldl -lrt -lpthread
Launch programs¶
Command srun
launches mpi programs.
DON’T USE mpirun AND mpiexec
Intel MPI Runtime Environment Variables¶
Control MPI behavior.
Variable | Value | Description |
---|---|---|
I_MPI_DEBUG | 0-5 | Print out debugging informationi when MPI program starts running. |
I_MPI_PLATFORM | ivb | Optimize for the Intel® Xeon® Processors formerly code named Ivy Bridge |
I_MPI_PERHOST | N/allcores | Define process layout, N processes per node, allcores on a node. |
I_MPI_PIN | on/off | Turn on/off process pinning. |
I_MPI_PIN_PROCESSOR_LIST | Get Help | Define a processor subset and the mapping rules for MPI processes within this subset. |
I_MPI__PIN_DOMAIN | Get Help | control process pinning for hybrid MPI/OpenMP applications |
I_MPI_FABRICS | shm:dapl | Network fabrics to be used |
I_MPI_EAGER_THRESHOLD | [nbytes] | Change the eager/rendezvous message size threshold for all devices, default 262144 bytes |
- If the I_MPI_PIN_DOMAIN environment variable is defined, then the I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.
OpenMPI¶
Open Source High Performance Computing
Available versions:
module avail openmpi ------------ /apps/modulefiles/parallel--------------- openmpi/1.10.1/gnu(default) openmpi/1.10.1/intel openmpi/1.10.0/gnu openmpi/1.10.0/intel openmpi/1.8.8 openmpi/1.8.7/gnu openmpi/1.8.7/intel openmpi/1.8.5/gnu openmpi/1.8.5/intel
For each version there two compiled flavors for openmpi, gnu
and intel
To select underlying compiler just load accordingly the module flavor you need.
Language | wrapper | GNU module | INTEL module |
---|---|---|---|
C | mpicc | openmpi/[version]/gnu | openmpi/[version]/intel |
C++ | mpicxx | openmpi/[version]/gnu | openmpi/[version]/intel |
FORTRAN | mpif90 | openmpi/[version]/gnu | openmpi/[version]/intel |
For example in order to use openMPI
with gcc/4.9.2
underlying compiler
consider load the gnu openmpi flavor
module load gnu/4.9.2 module load openmpi/1.8.5/gnu mpifort -show gfortran -I/apps/parallel/openmpi/1.8.5/gnu/include -pthread -I/apps/parallel/openmpi/1.8.5/gnu/lib -Wl,-rpath -Wl,/apps/parallel/openmpi/1.8.5/gnu/lib -Wl,--enable-new-dtags -L/apps/parallel/openmpi/1.8.5/gnu/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi
Respectively openMPI
with icc/15.0.3
module load intel/15.0.3 module load openmpi/1.8.5/intel mpicc -show icc -I/apps/parallel/openmpi/1.8.5/intel/include -pthread -Wl,-rpath -Wl,/apps/parallel/openmpi/1.8.5/intel/lib -Wl,--enable-new-dtags -L/apps/parallel/openmpi/1.8.5/intel/lib -lmpi
Launch programs¶
Command srun
launches mpi programs.
DON’T USE mpirun AND mpiexec
Intel Xeon Phi¶
To use Intel’s xeon phi coprocessor, load the intel
compiler module.
module load intel
Offload programming model¶
Currently only offload programming model is supported on ARIS supercomputer.
Control number of OMP threads¶
export MIC_ENV_PREFIX=MIC ## 60 physical cores 4 hardware threads export MIC_OMP_NUM_THREADS=240
Technical Information (Intel Xeon Phi 7120p)¶
Output of micinfo
command on one PHI node with 2 coprocessors.
MicInfo Utility Log System Info HOST OS : Linux OS Version : 2.6.32-573.18.1.el6.x86_64 Driver Version : 3.7.1-1 MPSS Version : 3.7.1 Host Physical Memory : 64317 MB Device No: 0, Device Name: mic0 Version Flash Version : 2.1.02.0391 SMC Firmware Version : 1.17.6900 SMC Boot Loader Version : 1.8.4326 Coprocessor OS Version : 2.6.38.8+mpss3.7.1 Device Serial Number : ADKC60900153 Board Vendor ID : 0x8086 Device ID : 0x225c Subsystem ID : 0x7d95 Coprocessor Stepping ID : 2 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 256 bytes PCIe Max read req size : 4096 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : C0 Board SKU : C0PRQ-7120 P/A/X/D ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 61 Voltage : 0 uV Frequency : 1238095 kHz Thermal Fan Speed Control : N/A Fan RPM : N/A Fan PWM : N/A Die Temp : 34 C GDDR GDDR Vendor : Samsung GDDR Version : 0x6 GDDR Density : 4096 Mb GDDR Size : 15872 MB GDDR Technology : GDDR5 GDDR Speed : 5.500000 GT/s GDDR Frequency : 2750000 kHz GDDR Voltage : 1501000 uV Device No: 1, Device Name: mic1 Version Flash Version : 2.1.02.0391 SMC Firmware Version : 1.17.6900 SMC Boot Loader Version : 1.8.4326 Coprocessor OS Version : 2.6.38.8+mpss3.7.1 Device Serial Number : ADKC60900052 Board Vendor ID : 0x8086 Device ID : 0x225c Subsystem ID : 0x7d95 Coprocessor Stepping ID : 2 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 256 bytes PCIe Max read req size : 4096 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : C0 Board SKU : C0PRQ-7120 P/A/X/D ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 61 Voltage : 0 uV Frequency : 1238095 kHz Thermal Fan Speed Control : N/A Fan RPM : N/A Fan PWM : N/A Die Temp : 36 C GDDR GDDR Vendor : Samsung GDDR Version : 0x6 GDDR Density : 4096 Mb GDDR Size : 15872 MB GDDR Technology : GDDR5 GDDR Speed : 5.500000 GT/s GDDR Frequency : 2750000 kHz GDDR Voltage : 1501000 uV
NVIDIA CUDA¶
CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach known as GPGPU. The CUDA platform is a software layer that gives direct access to the GPU’s virtual instruction set and parallel computational elements, for the execution of compute kernels.
To use NVIDIA’s compiler suite, load cuda
module.
module avail cuda ------------------- /apps/modulefiles/compilers ------------------ cuda/6.5.14 cuda/7.0.28 cuda/7.5.18(default)
module load cuda nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Tue_Aug_11_14:27:32_CDT_2015 Cuda compilation tools, release 7.5, V7.5.17
Example output of the deviceQuery
sample on a GPU node whith 2 Tesla K40.
CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "Tesla K40m" CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 3.5 Total amount of global memory: 11520 MBytes (12079136768 bytes) (15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores GPU Max Clock rate: 876 MHz (0.88 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "Tesla K40m" CUDA Driver Version / Runtime Version 7.5 / 7.5 CUDA Capability Major/Minor version number: 3.5 Total amount of global memory: 11520 MBytes (12079136768 bytes) (15) Multiprocessors, (192) CUDA Cores/MP: 2880 CUDA Cores GPU Max Clock rate: 876 MHz (0.88 GHz) Memory Clock rate: 3004 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 131 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from Tesla K40m (GPU0) -> Tesla K40m (GPU1) : No > Peer access from Tesla K40m (GPU1) -> Tesla K40m (GPU0) : No deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K40m Result = PASS
Debuggers¶
INTEL | gdb-ia |
PGI | pgdbg |
GNU | gdb,ddd |
CUDA | cuda-gdb |
GDB¶
GDB, the GNU Project debugger, allows you to see what is going on `inside’ another program while it executes – or what another program was doing at the moment it crashed.
Compile your code with debugging information
gcc [flags] -g [source file] -o [output filer]
Start session
gdb ./a.out
GDB commands¶
Command | Description |
---|---|
help | display a list of named classes of commands |
run | start the program |
attach | attach to a running process outside GDB |
step | go to the next source line, will step into a function/subroutine |
next | go to the next source line, function/subroutine calls are executed without stepping into them |
continue | continue executing |
break | sets breakpoint |
watch | set a watchpoint to stop execution when the value of a variable or an expression changes |
list | display (default 10) lines of source surrounding the current line |
print value of a variable | |
backtrace | displays a stack frame for each active subroutine |
detach | detach from a process |
quit | exit GDB |
To execute shell commands during the debugging session issue shell
in front of the command, e.g.
(gdb) shell ls -l
GDB-IA¶
Debugger provided by intel.
module load intel
PGDBG¶
PGDBG® is a graphical debugger for Linux, OS X and Windows capable of debugging serial and parallel programs including MPI process-parallel, OpenMP thread-parallel and hybrid MPI/OpenMP applications. PGDBG can debug programs on SMP workstations, servers, distributed-memory clusters and hybrid clusters where each node contains multiple 64-bit or 32-bit multicore processors.
module load pgi
DDD¶
GNU DDD is a graphical front-end for command-line debuggers such as GDB, DBX, WDB, Ladebug, JDB, XDB, the Perl debugger, the bash debugger bashdb, the GNU Make debugger remake, or the Python debugger pydb. Besides ``usual’‘ front-end features such as viewing source texts, DDD has become famous through its interactive graphical data display, where data structures are displayed as graphs. For more information (and more screenshots), see the DDD Manual.
CUDA-GDB¶
Performance Analysis¶
Performance Analysis Tools | Version |
---|---|
INTEL VTUNE | 2015 |
PGI pgprof | 2015 |
GNU gprof | 2.25 |
Scalasca | 2.2.2 |
mpiP | 3.4.1 |
nvprof |
GPROF¶
GNU profiler gprof
module load binutils
GNU gprof is a widely used profiling tool for Unix systems which produces an execution profile of C and Fortran programs. It can show the application call graph, which represents the calling relationships between functions in the program, and the percentage of total execution time spent in each function.
Compile and Link your code with -pg
flag
gprof [flags] -g [source_file] -o [output_file] -pg
Invoke gprof
to analyse and display profiling results.
gprof options [executable-file] gmon.out bb-data [yet-more-profile-data-files...] [> outfile]
Output Options
--flat-profile
: prints the total amount of time spent and the number of calls to each function--graph
: prints the call-graph analysis from the application execution--annotated-source
: prints profiling information next to the original source code
VTUNE Amplifier XE¶
Whether you are tuning for the first time or doing advanced performance optimization, Intel® VTune™ Amplifier XE provides the data needed to meet a wide variety of tuning needs. Collect a rich set of performance data for hotspots, threading, OpenCL, locks and waits, DirectX*, bandwidth, and more. But good data is not enough. You need tools to mine the data and make it easy to interpret. Powerful analysis lets you sort, filter, and visualize results on the timeline and on your source. Identify serial time and load imbalance. Select slow Open MP instances and discover why they are slow.
module load intel
GUI¶
amplxe-gui
Please use gui only on login nodes to analyze your report
Command Line¶
You can use command line tool to analyze your program on compute nodes.
amplxe-cl
Check help information
amplxe-cl -help
amplxe-cl -help collect
Perform hotspot analysis
amplxe-cl -collect hotspots -result-dir mydir /home/test/myprogram
Check result summary
amplxe-cl -R summary -r mydir
SCALASCA¶
Scalasca is a software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes.
module load scalasca/2.2.2
mpiP¶
mpiP is a lightweight profiling library for MPI applications. Because it only collects statistical information about MPI functions, mpiP generates considerably less overhead and much less data than tracing tools. All the information captured by mpiP is task-local. It only uses communication during report generation, typically at the end of the experiment, to merge results from all of the tasks into one output file.
module load mpip
nvprof¶
You can you use the nvprof
to collect and view profiling data from the command-line, either import them to visual profiler nvpp
.
Command line nvprof
¶
http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview
nvprof <GPU_EXECUTABLE>
Remote profiling with nvprof
¶
http://docs.nvidia.com/cuda/profiler-users-guide/index.html#unique_307789860
nvprof --export-profile timeline.nvprof <GPU_EXECUTABLE>
To view collected timeline data, the timeline.nvprof file can be imported into
nvvp
as described in Import Single-Process nvprof Session - See more at:
http://docs.nvidia.com/cuda/profiler-users-guide/index.html#import-session
MPI Profiling¶
http://docs.nvidia.com/cuda/profiler-users-guide/index.html#mpi-profiling
The nvprof
profiler can be used to profile individual MPI processes.
srun nvprof -o output.%h.%p.%q{SLURM_PROCID} <GPU_EXECUTABLE>