Performance Analysis¶

Performance Analysis Tools	Version
INTEL VTUNE	2015, 2016, 2017, 2018
PGI pgprof	2015,2016,2017
GNU gprof	2.25, 2.26, 2.27, 2.28, 2.29
Scalasca	2.2.2, 2.3.1
mpiP	3.4.1
nvprof	6.5.14, 7.0.28, 7.5.18, 8.0.27, 8.0.44, 8.0.61

GPROF¶

GNU profiler gprof

module load binutils

GNU gprof is a widely used profiling tool for Unix systems which produces an execution profile of C and Fortran programs. It can show the application call graph, which represents the calling relationships between functions in the program, and the percentage of total execution time spent in each function.

Compile and Link your code with -pg flag

gcc [flags] -g [source_file] -o [output_file] -pg

Invoke gprof to analyse and display profiling results.

gprof options [executable-file] gmon.out bb-data [yet-more-profile-data-files...] [> outfile]

Output Options

--flat-profile : prints the total amount of time spent and the number of calls to each function
--graph: prints the call-graph analysis from the application execution
--annotated-source : prints profiling information next to the original source code

GPROF manual

VTUNE Amplifier XE¶

Whether you are tuning for the first time or doing advanced performance optimization, Intel® VTune™ Amplifier XE provides the data needed to meet a wide variety of tuning needs. Collect a rich set of performance data for hotspots, threading, OpenCL, locks and waits, DirectX*, bandwidth, and more. But good data is not enough. You need tools to mine the data and make it easy to interpret. Powerful analysis lets you sort, filter, and visualize results on the timeline and on your source. Identify serial time and load imbalance. Select slow Open MP instances and discover why they are slow.

module load intel

GUI¶

amplxe-gui

Please use gui only on login nodes to analyze your report

Command Line¶

You can use command line tool to analyze your program on compute nodes. amplxe-cl

Check help information

amplxe-cl -help

amplxe-cl -help collect

Perform hotspot analysis

amplxe-cl -collect hotspots -result-dir mydir /home/test/myprogram

Check result summary

amplxe-cl -R summary -r mydir

Vtune web

SCALASCA¶

Scalasca is a software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes.

module load scalasca/2.2.2

Scalasca Documentation

mpiP¶

mpiP is a lightweight profiling library for MPI applications. Because it only collects statistical information about MPI functions, mpiP generates considerably less overhead and much less data than tracing tools. All the information captured by mpiP is task-local. It only uses communication during report generation, typically at the end of the experiment, to merge results from all of the tasks into one output file.

Compile and Link with the mpiP library¶

module load mpiP
# gnu
mpif90 mycode.f -g -L$MPIPROOT/lib -lmpiP -lbfd -lunwind -o mycode.x
# intel
mpiifort mycode.f -g -L$MPIPROOT/lib -lmpiP -lbfd -lunwind -o mycode.x

In your slurm script just run the executable

srun mycode.x

after completion check the report file mycode.x.NPROCS.PID.mpiP

mpiP Documentation

nvprof¶

You can you use the nvprof to collect and view profiling data from the command-line, either import them to visual profiler nvpp.

Command line `nvprof`¶

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#nvprof-overview

nvprof <GPU_EXECUTABLE>

Remote profiling with `nvprof`¶

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#unique_307789860

nvprof --export-profile timeline.nvprof <GPU_EXECUTABLE>

To view collected timeline data, the timeline.nvprof file can be imported into nvvp as described in Import Single-Process nvprof Session - See more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html#import-session

MPI Profiling¶

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#mpi-profiling

The nvprof profiler can be used to profile individual MPI processes.

srun nvprof -o output.%h.%p.%q{SLURM_PROCID} <GPU_EXECUTABLE>