"Fossies" - the Fresh Open Source Software Archive

Member "scalasca-2.6/doc/patterns/scalasca_patterns.html" (19 Apr 2021, 179385 Bytes) of package /linux/misc/scalasca-2.6.tar.gz:


Caution: In this restricted "Fossies" environment the current HTML page may not be correctly presentated and may have some non-functional links. You can here alternatively try to browse the pure source code or just view or download the uninterpreted raw source code. If the rendering is insufficient you may try to find and view the page on the scalasca-2.6.tar.gz project site itself.

Performance properties

Time

Description:
Total time spent for program execution including the idle times of CPUs reserved for worker threads during OpenMP sequential execution. This pattern assumes that every thread of a process allocated a separate CPU during the entire runtime of the process. Executions in a time-shared environment will also include time slices used by other processes. Over-subscription of processor cores (e.g., exploiting hardware threads) will also manifest as additional CPU allocation time.
Unit:
Seconds
Diagnosis:
Expand the metric tree hierarchy to break down total time into constituent parts which will help determine how much of it is due to local/serial computation versus MPI, OpenMP, or POSIX thread parallelization costs, and how much of that time is wasted waiting for other processes or threads due to ineffective load balance or due to insufficient parallelism.

Expand the call tree to identify important callpaths and routines where most time is spent, and examine the times for each process or thread to locate load imbalance.
Parent metric:
None
Sub-metrics:
Execution Time
Overhead Time
OpenMP Idle Threads Time

Visits

Description:
Number of times a call path has been visited. Visit counts for MPI routine call paths directly relate to the number of MPI MPI Communication Operations and MPI Synchronization Operations. Visit counts for OpenMP operations and parallel regions (loops) directly relate to the number of times they were executed. Routines which were not instrumented, or were filtered during measurement, do not appear on recorded call paths. Similarly, routines are not shown if the compiler optimizer successfully in-lined them prior to automatic instrumentation.
Unit:
Counts
Diagnosis:
Call paths that are frequently visited (and thereby have high exclusive Visit counts) can be expected to have an important role in application execution performance (e.g., Execution Time). Very frequently executed routines, which are relatively short and quick to execute, may have an adverse impact on measurement quality. This can be due to instrumentation preventing in-lining and other compiler optimizations and/or overheads associated with measurement such as reading timers and hardware counters on routine entry and exit. When such routines consist solely of local/sequential computation (i.e., neither communication nor synchronization), they should be eliminated to improve the quality of the parallel measurement and analysis. One approach is to specify the names of such routines in a filter file for subsequent measurements to ignore, and thereby considerably reduce their measurement impact. Alternatively, selective instrumentation can be employed to entirely avoid instrumenting such routines and thereby remove all measurement impact. In both cases, uninstrumented and filtered routines will not appear in the measurement and analysis, much as if they had been "in-lined" into their calling routine.
Parent metric:
None
Sub-metrics:
None

Execution Time

(only available after remapping)
Description:
Time spent on program execution but without the idle times of worker threads during OpenMP sequential execution and time spent on tasks related to trace generation. Includes time blocked in system calls (e.g., waiting for I/O to complete) and processor stalls (e.g., memory accesses).
Unit:
Seconds
Diagnosis:
A low fraction of execution time indicates a suboptimal measurement configuration leading to trace buffer flushes (see Overhead Time) or inefficient usage of the available hardware resources (see OpenMP Idle Threads Time).
Parent metric:
Time
Sub-metrics:
Computation Time
MPI Time
OpenMP Time
POSIX Threads Time
OpenACC Time
OpenCL Time
CUDA Time

Overhead Time

(only available after remapping)
Description:
Time spent performing major tasks related to measurement, such as creation of the experiment archive directory, clock synchronization, or dumping trace buffer contents to a file. Note that normal per-event overheads – such as event acquisition, reading timers and hardware counters, runtime call-path summarization, and storage in trace buffers – is not included.
Unit:
Seconds
Diagnosis:
Significant measurement overheads are typically incurred when measurement is initialized (e.g., in the program main routine or MPI_Init) and finalized (e.g., in MPI_Finalize), and are generally unavoidable. While they extend the total (wallclock) time for measurement, when they occur before parallel execution starts or after it completes, the quality of measurement of the parallel execution is not degraded. Trace file writing overhead time can be kept to a minimum by specifying an efficient parallel filesystem (when provided) for the experiment archive (e.g., SCOREP_EXPERIMENT_DIRECTORY=/work/mydir).

When measurement overhead is reported for other call paths, especially during parallel execution, measurement perturbation is considerable and interpretation of the resulting analysis much more difficult. A common cause of measurement overhead during parallel execution is the flushing of full trace buffers to disk: warnings issued by the measurement system indicate when this occurs. When flushing occurs simultaneously for all processes and threads, the associated perturbation is localized in time. More usually, buffer filling and flushing occurs independently at different times on each process/thread and the resulting perturbation is extremely disruptive, often forming a catastrophic chain reaction. It is highly advisable to avoid intermediate trace buffer flushes by appropriate instrumentation and measurement configuration, such as specifying a filter file listing purely computational routines (classified as type USR by scorep-score -r ) or an adequate trace buffer size (SCOREP_TOTAL_MEMORY larger than max_buf reported by scorep-score). If the maximum trace buffer capacity requirement remains too large for a full-size measurement, it may be necessary to configure the subject application with a smaller problem size or to perform fewer iterations/timesteps to shorten the measurement (and thereby reduce the size of the trace).
Parent metric:
Time
Sub-metrics:
None

Computation Time

(only available after remapping)
Description:
Time spent in computational parts of the application, excluding communication and synchronization overheads of parallelization libaries/language extensions such as MPI, OpenMP, or POSIX threads.
Unit:
Seconds
Diagnosis:
Expand the call tree to determine important callpaths and routines where most computation time is spent, and examine the time for each process or thread on those callpaths looking for significant variations which might indicate the origin of load imbalance.

Where computation time on each process/thread is unexpectedly slow, profiling with PAPI preset or platform-specific hardware counters may help to understand the origin. Serial program profiling tools (e.g., gprof) may also be helpful. Generally, compiler optimization flags and optimized libraries should be investigated to improve serial performance, and where necessary alternative algorithms employed.
Parent metric:
Execution Time
Sub-metrics:
OpenCL Kernel Time
CUDA Kernel Time

MPI Time

(only available after remapping)
Description:
Time spent in (instrumented) MPI calls. Note that depending on the setting of the SCOREP_MPI_ENABLE_GROUPS environment variable, certain classes of MPI calls may have been excluded from measurement and therefore do not show up in the analysis report.
Unit:
Seconds
Diagnosis:
Expand the metric tree to determine which classes of MPI operation contribute the most time. Typically the remaining (exclusive) MPI Time, corresponding to instrumented MPI routines that are not in one of the child classes, will be negligible.
Parent metric:
Execution Time
Sub-metrics:
MPI Management Time
MPI Synchronization Time
MPI Communication Time
MPI File I/O Time

MPI Management Time

(only available after remapping)
Description:
Time spent in MPI calls related to management operations, such as MPI initialization and finalization, opening/closing of files used for MPI file I/O, or creation/deletion of various handles (e.g., communicators or RMA windows).
Unit:
Seconds
Diagnosis:
Expand the metric tree to determine which classes of MPI management operation contribute the most time. While some management costs are unavoidable, others can be decreased by improving load balance or reusing existing handles rather than repeatedly creating and deleting them.
Parent metric:
MPI Time
Sub-metrics:
MPI Init/Finalize Time
MPI Communicator Management Time
MPI File Management Time
MPI Window Management Time

MPI Init/Finalize Time

(only available after remapping)
Description:
Time spent in MPI initialization and finalization calls, i.e., MPI_Init or MPI_Init_thread and MPI_Finalize.
Unit:
Seconds
Diagnosis:
These are unavoidable one-off costs for MPI parallel programs, which can be expected to increase for larger numbers of processes. Some applications may not use all of the processes provided (or not use some of them for the entire execution), such that unused and wasted processes wait in MPI_Finalize for the others to finish. If the proportion of time in these calls is significant, it is probably more effective to use a smaller number of processes (or a larger amount of computation).
Parent metric:
MPI Management Time
Sub-metrics:
MPI Initialization Completion Time
Wait at MPI Finalize Time

MPI Initialization Completion Time

Description:
Time spent in MPI initialization after the first process has left the operation.


MPI Initialization Completion Example

Unit:
Seconds
Diagnosis:
Generally all processes can be expected to leave MPI initialization simultaneously, and any significant initialization completion time may indicate an inefficient MPI implementation or interference from other processes running on the same compute resources.
Parent metric:
MPI Init/Finalize Time
Sub-metrics:
None

Wait at MPI Finalize Time

Description:
Time spent waiting in front of MPI finalization, which is the time inside MPI_Finalize until the last processes has reached finalization.


MPI Wait at Finalize Example

Unit:
Seconds
Diagnosis:
A large amount of waiting time at finalization can be an indication of load imbalance. Examine the waiting times for each process and try to distribute the preceding computation from processes with the shortest waiting times to those with the longest waiting times.
Parent metric:
MPI Init/Finalize Time
Sub-metrics:
None

MPI Communicator Management Time

(only available after remapping)
Description:
Time spent in MPI Communicator management routines such as creating and freeing communicators, Cartesian and graph topologies, and getting or setting communicator attributes.
Unit:
Seconds
Diagnosis:
There can be significant time in collective operations such as MPI_Comm_create, MPI_Comm_free and MPI_Cart_create that are considered neither explicit synchronization nor communication, but result in implicit barrier synchronization of participating processes. Avoidable waiting time for these operations will be reduced if all processes execute them simultaneously. If these are repeated operations, e.g., in a loop, it is worth investigating whether their frequency can be reduced by re-use.
Parent metric:
MPI Management Time
Sub-metrics:
None

MPI File Management Time

(only available after remapping)
Description:
Time spent in MPI file management routines such as opening, closing, deleting, or resizing files, seeking, syncing, and setting or retrieving file parameters or the process's view of the data in the file.
Unit:
Seconds
Diagnosis:
Collective file management calls (see MPI Collective File Operations) may suffer from wait states due to load imbalance. Examine the times spent in collective management routines for each process and try to distribute the preceding computation from processes with the shortest times to those with the longest times.
Parent metric:
MPI Management Time
Sub-metrics:
None

MPI Window Management Time

(only available after remapping)
Description:
Time spent in MPI window management routines such as creating and freeing memory windows and getting or setting window attributes.
Unit:
Seconds
Parent metric:
MPI Management Time
Sub-metrics:
MPI Wait at Window Create Time
MPI Wait at Window Free Time

MPI Synchronization Time

(only available after remapping)
Description:
Time spent in MPI explicit synchronization calls, such as barriers and remote memory access window synchronization. Time in point-to-point message transfers with no payload data used for coordination is currently part of MPI Point-to-point Communication Time.
Unit:
Seconds
Diagnosis:
Expand the metric tree further to determine the proportion of time in different classes of MPI synchronization operations. Expand the calltree to identify which callpaths are responsible for the most synchronization time. Also examine the distribution of synchronization time on each participating process for indication of load imbalance in preceding code.
Parent metric:
MPI Time
Sub-metrics:
MPI Collective Synchronization Time
MPI One-sided Synchronization Time

MPI Collective Synchronization Time

(only available after remapping)
Description:
Total time spent in MPI barriers.
Unit:
Seconds
Diagnosis:
When the time for MPI explicit barrier synchronization is significant, expand the call tree to determine which MPI_Barrier calls are responsible, and compare with their Visits count to see how frequently they were executed. Barrier synchronizations which are not necessary for correctness should be removed. It may also be appropriate to use a communicator containing fewer processes, or a number of point-to-point messages for coordination instead. Also examine the distribution of time on each participating process for indication of load imbalance in preceding code.
Parent metric:
MPI Synchronization Time
Sub-metrics:
Wait at MPI Barrier Time
MPI Barrier Completion Time

Wait at MPI Barrier Time

Description:
Time spent waiting in front of an MPI barrier, which is the time inside the barrier call until the last processes has reached the barrier.


MPI Wait at Barrier Example

Note that Scalasca does not yet analyze non-blocking barriers introduced with MPI v3.0.
Unit:
Seconds
Diagnosis:
A large amount of waiting time at barriers can be an indication of load imbalance. Examine the waiting times for each process and try to distribute the preceding computation from processes with the shortest waiting times to those with the longest waiting times.
Parent metric:
MPI Collective Synchronization Time
Sub-metrics:
None

MPI Barrier Completion Time

Description:
Time spent in MPI barriers after the first process has left the operation.


MPI Barrier Completion Example

Note that Scalasca does not yet analyze non-blocking barriers introduced with MPI v3.0.
Unit:
Seconds
Diagnosis:
Generally all processes can be expected to leave MPI barriers simultaneously, and any significant barrier completion time may indicate an inefficient MPI implementation or interference from other processes running on the same compute resources.
Parent metric:
MPI Collective Synchronization Time
Sub-metrics:
None

MPI Communication Time

(only available after remapping)
Description:
Time spent in MPI communication calls, including point-to-point, collective, and one-sided communication.
Unit:
Seconds
Diagnosis:
Expand the metric tree further to determine the proportion of time in different classes of MPI communication operations. Expand the calltree to identify which callpaths are responsible for the most communication time. Also examine the distribution of communication time on each participating process for indication of communication imbalance or load imbalance in preceding code.
Parent metric:
MPI Time
Sub-metrics:
MPI Point-to-point Communication Time
MPI Collective Communication Time
MPI One-sided Communication Time

MPI Point-to-point Communication Time

(only available after remapping)
Description:
Total time spent in MPI point-to-point communication calls. Note that this is only the respective times for the sending and receiving calls, and not message transmission time.
Unit:
Seconds
Diagnosis:
Investigate whether communication time is commensurate with the number of MPI Communication Operations and MPI Bytes Transferred. Consider replacing blocking communication with non-blocking communication that can potentially be overlapped with computation, or using persistent communication to amortize message setup costs for common transfers. Also consider the mapping of processes onto compute resources, especially if there are notable differences in communication time for particular processes, which might indicate longer/slower transmission routes or network congestion.
Parent metric:
MPI Communication Time
Sub-metrics:
MPI Late Sender Time
MPI Late Receiver Time

MPI Late Sender Time

Description:
Refers to the time lost waiting caused by a blocking receive operation (e.g., MPI_Recv or MPI_Wait) that is posted earlier than the corresponding send operation.


MPI Late Sender Example

If the receiving process is waiting for multiple messages to arrive (e.g., in an call to MPI_Waitall), the maximum waiting time is accounted, i.e., the waiting time due to the latest sender.
Unit:
Seconds
Diagnosis:
Try to replace MPI_Recv with a non-blocking receive MPI_Irecv that can be posted earlier, proceed concurrently with computation, and complete with a wait operation after the message is expected to have been sent. Try to post sends earlier, such that they are available when receivers need them. Note that outstanding messages (i.e., sent before the receiver is ready) will occupy internal message buffers, and that large numbers of posted receive buffers will also introduce message management overhead, therefore moderation is advisable.
Parent metric:
MPI Point-to-point Communication Time
Sub-metrics:
MPI Late Sender, Wrong Order Time

MPI Late Sender, Wrong Order Time

Description:
A Late Sender situation may be the result of messages that are received in the wrong order. If a process expects messages from one or more processes in a certain order, although these processes are sending them in a different order, the receiver may need to wait for a message if it tries to receive a message early that has been sent late.

This pattern comes in two variants: See the description of the corresponding specializations for more details.
Unit:
Seconds
Diagnosis:
Check the proportion of MPI Point-to-point Receive Communication Operations that are MPI Late Sender Instances (Communications). Swap the order of receiving from different sources to match the most common ordering.
Parent metric:
MPI Late Sender Time
Sub-metrics:
MPI Late Sender, Wrong Order Time / Different Sources
MPI Late Sender, Wrong Order Time / Same Source

MPI Late Sender, Wrong Order Time / Different Sources

Description:
This specialization of the Late Sender, Wrong Order pattern refers to wrong order situations due to messages received from different source locations.


MPI Late Sender, wrong order (different source) Example

Unit:
Seconds
Diagnosis:
Check the proportion of MPI Point-to-point Receive Communication Operations that are MPI Late Sender, Wrong Order Instances (Communications). Swap the order of receiving from different sources to match the most common ordering. Consider using the wildcard MPI_ANY_SOURCE to receive (and process) messages as they arrive from any source rank.
Parent metric:
MPI Late Sender, Wrong Order Time
Sub-metrics:
None

MPI Late Sender, Wrong Order Time / Same Source

Description:
This specialization of the Late Sender, Wrong Order pattern refers to wrong order situations due to messages received from the same source location.


MPI Late Sender, wrong order (same source) Example

Unit:
Seconds
Diagnosis:
Swap the order of receiving to match the order messages are sent, or swap the order of sending to match the order they are expected to be received. Consider using the wildcard MPI_ANY_TAG to receive (and process) messages in the order they arrive from the source.
Parent metric:
MPI Late Sender, Wrong Order Time
Sub-metrics:
None

MPI Late Receiver Time

Description:
A send operation may be blocked until the corresponding receive operation is called. This pattern refers to the time spent waiting as a result of this situation.


MPI Late Receiver Example

Unit:
Seconds
Diagnosis:
Check the proportion of MPI Point-to-point Send Communication Operations that are MPI Late Receiver Instances (Communications). The MPI implementation may be working in synchronous mode by default, such that explicit use of asynchronous nonblocking sends can be tried. If the size of the message to be sent exceeds the available MPI internal buffer space then the operation will be blocked until the data can be transferred to the receiver: some MPI implementations allow larger internal buffers or different thresholds to be specified. Also consider the mapping of processes onto compute resources, especially if there are notable differences in communication time for particular processes, which might indicate longer/slower transmission routes or network congestion.
Parent metric:
MPI Point-to-point Communication Time
Sub-metrics:
None

MPI Collective Communication Time

(only available after remapping)
Description:
Total time spent in MPI collective communication calls.
Unit:
Seconds
Diagnosis:
As the number of participating MPI processes increase (i.e., ranks in MPI_COMM_WORLD or a subcommunicator), time in collective communication can be expected to increase correspondingly. Part of the increase will be due to additional data transmission requirements, which are generally similar for all participants. A significant part is typically time some (often many) processes are blocked waiting for the last of the required participants to reach the collective operation. This may be indicated by significant variation in collective communication time across processes, but is most conclusively quantified from the child metrics determinable via automatic trace pattern analysis.

Since basic transmission cost per byte for collectives can be relatively high, combining several collective operations of the same type each with small amounts of data (e.g., a single value per rank) into fewer operations with larger payloads using either a vector/array of values or aggregate datatype may be beneficial. (Overdoing this and aggregating very large message payloads is counter-productive due to explicit and implicit memory requirements, and MPI protocol switches for messages larger than an eager transmission threshold.)

MPI implementations generally provide optimized collective communication operations, however, in rare cases, it may be appropriate to replace a collective communication operation provided by the MPI implementation with a customized implementation of your own using point-to-point operations. For example, certain MPI implementations of MPI_Scan include unnecessary synchronization of all participating processes, or asynchronous variants of collective operations may be preferable to fully synchronous ones where they permit overlapping of computation.
Parent metric:
MPI Communication Time
Sub-metrics:
MPI Early Reduce Time
MPI Early Scan Time
MPI Late Broadcast Time
MPI Wait at N x N Time
MPI N x N Completion Time

MPI Early Reduce Time

Description:
Collective communication operations that send data from all processes to one destination process (i.e., n-to-1) may suffer from waiting times if the destination process enters the operation earlier than any of its sending counterparts. This pattern refers to the time lost on the root rank as a result of this situation, accounting for the waiting time due to the latest sending process. It applies to the MPI calls MPI_Reduce, MPI_Gather and MPI_Gatherv.


MPI Early Reduce Example

Note that Scalasca does not yet analyze non-blocking collectives introduced with MPI v3.0.
Unit:
Seconds
Parent metric:
MPI Collective Communication Time
Sub-metrics:
None

MPI Early Scan Time

Description:
MPI_Scan or MPI_Exscan operations may suffer from waiting times if the process with rank n enters the operation earlier than its sending counterparts (i.e., ranks 0..n-1). This pattern refers to the time lost as a result of this situation.


MPI Early Scan Example

Note that Scalasca does not yet analyze non-blocking collectives introduced with MPI v3.0.
Unit:
Seconds
Parent metric:
MPI Collective Communication Time
Sub-metrics:
None

MPI Late Broadcast Time

Description:
Collective communication operations that send data from one source process to all processes (i.e., 1-to-n) may suffer from waiting times if destination processes enter the operation earlier than the source process, that is, before any data could have been sent. This pattern refers to the time lost as a result of this situation. It applies to the MPI calls MPI_Bcast, MPI_Scatter and MPI_Scatterv.


MPI Late Broadcast Example

Note that Scalasca does not yet analyze non-blocking collectives introduced with MPI v3.0.
Unit:
Seconds
Parent metric:
MPI Collective Communication Time
Sub-metrics:
None

MPI Wait at N x N Time

Description:
Collective communication operations that send data from all processes to all processes (i.e., n-to-n) exhibit an inherent synchronization among all participants, that is, no process can finish the operation until the last process has started it. This pattern covers the time spent in n-to-n operations until all processes have reached it. It applies to the MPI calls MPI_Reduce_scatter, MPI_Reduce_scatter_block, MPI_Allgather, MPI_Allgatherv, MPI_Allreduce and MPI_Alltoall.


MPI Wait at N x N Example

Note that the time reported by this pattern is not necessarily completely waiting time since some processes could – at least theoretically – already communicate with each other while others have not yet entered the operation.

Also note that Scalasca does not yet analyze non-blocking and neighborhood collectives introduced with MPI v3.0.
Unit:
Seconds
Parent metric:
MPI Collective Communication Time
Sub-metrics:
None

MPI N x N Completion Time

Description:
This pattern refers to the time spent in MPI n-to-n collectives after the first process has left the operation.


MPI N x N Completion Example

Note that the time reported by this pattern is not necessarily completely waiting time since some processes could – at least theoretically – still communicate with each other while others have already finished communicating and exited the operation.

Also note that Scalasca does not yet analyze non-blocking and neighborhood collectives introduced with MPI v3.0.
Unit:
Seconds
Parent metric:
MPI Collective Communication Time
Sub-metrics:
None

MPI File I/O Time

(only available after remapping)
Description:
Time spent in MPI file I/O calls.
Unit:
Seconds
Diagnosis:
Expand the metric tree further to determine the proportion of time in different classes of MPI file I/O operations. Expand the calltree to identify which callpaths are responsible for the most file I/O time. Also examine the distribution of MPI file I/O time on each process for indication of load imbalance. Use a parallel filesystem (such as /work) when possible, and check that appropriate hints values have been associated with the MPI_Info object of MPI files.
Parent metric:
MPI Time
Sub-metrics:
MPI Individual File I/O Time
MPI Collective File I/O Time

MPI Individual File I/O Time

(only available after remapping)
Description:
Time spent in individual MPI file I/O calls.
Unit:
Seconds
Diagnosis:
Expand the calltree to identify which callpaths are responsible for the most individual file I/O time. When multiple processes read and write to files, MPI collective file reads and writes can be more efficient. Examine the number of MPI Individual File Read Operations and MPI Individual File Write Operations to locate potential opportunities for collective I/O.
Parent metric:
MPI File I/O Time
Sub-metrics:
None

MPI Collective File I/O Time

(only available after remapping)
Description:
Time spent in collective MPI file I/O calls.
Unit:
Seconds
Diagnosis:
Expand the calltree to identify which callpaths are responsible for the most collective file I/O time. Examine the distribution of times on each participating process for indication of imbalance in the operation itself or in preceding code. Examine the number of MPI Collective File Read Operations and MPI Collective File Write Operations done by each process as a possible origin of imbalance. Where asychrony or imbalance prevents effective use of collective file I/O, individual (i.e., non-collective) file I/O may be preferable.
Parent metric:
MPI File I/O Time
Sub-metrics:
None

OpenMP Idle Threads Time

(only available after remapping)
Description:
Idle time on CPUs that may be reserved for teams of threads when the process is executing sequentially before and after OpenMP parallel regions, or with less than the full team within OpenMP parallel regions.


OMP Idle Threads Example

Unit:
Seconds
Diagnosis:
On shared compute resources, unused threads may simply sleep and allow the resources to be used by other applications, however, on dedicated compute resources (or where unused threads busy-wait and thereby occupy the resources) their idle time is charged to the application. According to Amdahl's Law, the fraction of inherently serial execution time limits the effectiveness of employing additional threads to reduce the execution time of parallel regions. Where the Idle Threads Time is significant, total Time (and wall-clock execution time) may be reduced by effective parallelization of sections of code which execute serially. Alternatively, the proportion of wasted Idle Threads Time will be reduced by running with fewer threads, albeit resulting in a longer wall-clock execution time but more effective usage of the allocated compute resources.
Parent metric:
Time
Sub-metrics:
OpenMP Limited Parallelism Time

OpenMP Limited Parallelism Time

(only available after remapping)
Description:
Idle time on CPUs that may be reserved for threads within OpenMP parallel regions where not all of the thread team participates.


OMP Limited parallelism Example

Unit:
Seconds
Diagnosis:
Code sections marked as OpenMP parallel regions which are executed serially (i.e., only by the master thread) or by less than the full team of threads, can result in allocated but unused compute resources being wasted. Typically this arises from insufficient work being available within the marked parallel region to productively employ all threads. This may be because the loop contains too few iterations or the OpenMP runtime has determined that additional threads would not be productive. Alternatively, the OpenMP omp_set_num_threads API or num_threads or if clauses may have been explicitly specified, e.g., to reduce parallel execution overheads such as OpenMP Thread Management Time or OpenMP Synchronization Time. If the proportion of OpenMP Limited Parallelism Time is significant, it may be more efficient to run with fewer threads for that problem size.
Parent metric:
OpenMP Idle Threads Time
Sub-metrics:
None

OpenMP Time

(only available after remapping)
Description:
Time spent in OpenMP API calls and code generated by the OpenMP compiler. In particular, this includes thread team management and synchronization activities.
Unit:
Seconds
Diagnosis:
Expand the metric tree to determine which classes of OpenMP activities contribute the most time.
Parent metric:
Execution Time
Sub-metrics:
OpenMP Thread Management Time
OpenMP Synchronization Time
OpenMP Flush Time

OpenMP Thread Management Time

Description:
Time spent managing teams of threads, creating and initializing them when forking a new parallel region and clearing up afterwards when joining.


OMP Management Example

Unit:
Seconds
Diagnosis:
Management overhead for an OpenMP parallel region depends on the number of threads to be employed and the number of variables to be initialized and saved for each thread, each time the parallel region is executed. Typically a pool of threads is used by the OpenMP runtime system to avoid forking and joining threads in each parallel region, however, threads from the pool still need to be added to the team and assigned tasks to perform according to the specified schedule. When the overhead is a significant proportion of the time for executing the parallel region, it is worth investigating whether several parallel regions can be combined to amortize thread management overheads. Alternatively, it may be appropriate to reduce the number of threads either for the entire execution or only for this parallel region (e.g., via num_threads or if clauses).
Parent metric:
OpenMP Time
Sub-metrics:
OpenMP Thread Team Fork Time

OpenMP Thread Team Fork Time

Description:
Time spent creating and initializing teams of threads.


OMP Fork Example

Unit:
Seconds
Parent metric:
OpenMP Thread Management Time
Sub-metrics:
None

OpenMP Synchronization Time

(only available after remapping)
Description:
Time spent in OpenMP synchronization, whether barriers or mutual exclusion via ordered sequentialization, critical sections, atomics or lock API calls.
Unit:
Seconds
Parent metric:
OpenMP Time
Sub-metrics:
OpenMP Barrier Synchronization Time
OpenMP Critical Synchronization Time
OpenMP Lock API Synchronization Time
OpenMP Ordered Synchronization Time
OpenMP Taskwait Synchronization Time

OpenMP Barrier Synchronization Time

(only available after remapping)
Description:
Time spent in implicit (compiler-generated) or explicit (user-specified) OpenMP barrier synchronization. Note that during measurement implicit barriers are treated similar to explicit ones. The instrumentation procedure replaces an implicit barrier with an explicit barrier enclosed by the parallel construct. This is done by adding a nowait clause and a barrier directive as the last statement of the parallel construct. In cases where the implicit barrier cannot be removed (i.e., parallel region), the explicit barrier is executed in front of the implicit barrier, which will then be negligible because the thread team will already be synchronized when reaching it. The synthetic explicit barrier appears as a special implicit barrier construct.
Unit:
Seconds
Parent metric:
OpenMP Synchronization Time
Sub-metrics:
OpenMP Explicit Barrier Synchronization Time
OpenMP Implicit Barrier Synchronization Time

OpenMP Explicit Barrier Synchronization Time

(only available after remapping)
Description:
Time spent in explicit (i.e., user-specified) OpenMP barrier synchronization, both waiting for other threads Wait at Explicit OpenMP Barrier Time and inherent barrier processing overhead.
Unit:
Seconds
Diagnosis:
Locate the most costly barrier synchronizations and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis). Consider replacing an explicit barrier with a potentially more efficient construct, such as a critical section or atomic, or use explicit locks. Examine the time that each thread spends waiting at each explicit barrier, and try to re-distribute preceding work to improve load balance.
Parent metric:
OpenMP Barrier Synchronization Time
Sub-metrics:
Wait at Explicit OpenMP Barrier Time

Wait at Explicit OpenMP Barrier Time

Description:
Time spent in explicit (i.e., user-specified) OpenMP barrier synchronization waiting for the last thread.
Unit:
Seconds
Diagnosis:
A large amount of waiting time at barriers can be an indication of load imbalance. Examine the waiting times for each thread and try to distribute the preceding computation from threads with the shortest waiting times to those with the longest waiting times.
Parent metric:
OpenMP Explicit Barrier Synchronization Time
Sub-metrics:
None

OpenMP Implicit Barrier Synchronization Time

(only available after remapping)
Description:
Time spent in implicit (i.e., compiler-generated) OpenMP barrier synchronization, both waiting for other threads Wait at Implicit OpenMP Barrier Time and inherent barrier processing overhead.
Unit:
Seconds
Diagnosis:
Examine the time that each thread spends waiting at each implicit barrier, and if there is a significant imbalance then investigate whether a schedule clause is appropriate. Note that dynamic and guided schedules may require more OpenMP Thread Management Time than static schedules. Consider whether it is possible to employ the nowait clause to reduce the number of implicit barrier synchronizations.
Parent metric:
OpenMP Barrier Synchronization Time
Sub-metrics:
Wait at Implicit OpenMP Barrier Time

Wait at Implicit OpenMP Barrier Time

Description:
Time spent in implicit (i.e., compiler-generated) OpenMP barrier synchronization.
Unit:
Seconds
Diagnosis:
A large amount of waiting time at barriers can be an indication of load imbalance. Examine the waiting times for each thread and try to distribute the preceding computation from threads with the shortest waiting times to those with the longest waiting times.
Parent metric:
OpenMP Implicit Barrier Synchronization Time
Sub-metrics:
None

OpenMP Critical Synchronization Time

(only available after remapping)
Description:
Time spent waiting to enter OpenMP critical sections and in atomics, where mutual exclusion restricts access to a single thread at a time.
Unit:
Seconds
Diagnosis:
Locate the most costly critical sections and atomics and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis).
Parent metric:
OpenMP Synchronization Time
Sub-metrics:
OpenMP Critical Contention Time

OpenMP Critical Contention Time

Description:
The time lost waiting before entering a critical section while another thread is still inside the section.


OMP Critical Contention Example

Unit:
Seconds
Diagnosis:
A large amount of waiting time can be an indication of too much balance, since all threads arrive at the critical almost at the same time. Examine the waiting times for each thread and try to distribute the preceding computation on the threads to allow a staggered arrival at the critical section.
Parent metric:
OpenMP Critical Synchronization Time
Sub-metrics:
None

OpenMP Lock API Synchronization Time

(only available after remapping)
Description:
Time spent in OpenMP lock API calls.
Unit:
Seconds
Diagnosis:
Locate the most costly usage of locks and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis). Consider re-writing the algorithm to use lock-free data structures.
Parent metric:
OpenMP Synchronization Time
Sub-metrics:
OpenMP Lock API Contention Time

OpenMP Lock API Contention Time

Description:
The time lost waiting for an explicit lock to be acquired while another thread still holds the corresponding lock.


OMP Lock API Contention Example

Unit:
Seconds
Diagnosis:
A large amount of waiting time can be an indication of too much balance, since all threads try to acquire the lock almost at the same time. Examine the waiting times for each thread and try to distribute the preceding computation on the threads to allow a staggered arrival at the lock.
Parent metric:
OpenMP Lock API Synchronization Time
Sub-metrics:
None

OpenMP Ordered Synchronization Time

(only available after remapping)
Description:
Time spent waiting to enter OpenMP ordered regions due to enforced sequentialization of loop iteration execution order in the region.
Unit:
Seconds
Diagnosis:
Locate the most costly ordered regions and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis).
Parent metric:
OpenMP Synchronization Time
Sub-metrics:
None

OpenMP Taskwait Synchronization Time

(only available after remapping)
Description:
Time spent in OpenMP taskwait directives, waiting for child tasks to finish.
Unit:
Seconds
Parent metric:
OpenMP Synchronization Time
Sub-metrics:
None

OpenMP Flush Time

(only available after remapping)
Description:
Time spent in OpenMP flush directives.
Unit:
Seconds
Parent metric:
OpenMP Time
Sub-metrics:
None

POSIX Threads Time

(only available after remapping)
Description:
Time spent in instrumented POSIX threads API calls. In particular, this includes thread management and synchronization activities.
Unit:
Seconds
Diagnosis:
Expand the metric tree to determine which classes of POSIX thread activities contribute the most time.
Parent metric:
Execution Time
Sub-metrics:
POSIX Threads Management Time
POSIX Threads Synchronization Time

POSIX Threads Management Time

(only available after remapping)
Description:
Time spent managing (i.e., creating, joining, cancelling, etc.) POSIX threads.
Unit:
Seconds
Diagnosis:
Excessive POSIX threads management time in pthread_join indicates load imbalance which causes wait states in the joining threads waiting for the other thread to finish. Examine the join times and try to re-distribute the computation in the corresponding worker threads to achieve a better load balance.

Also, correlate the thread management time to the Visits of management routines. If visit counts are high, consider using a thread pool to reduce the number of thread management operations.
Parent metric:
POSIX Threads Time
Sub-metrics:
None

POSIX Threads Synchronization Time

(only available after remapping)
Description:
Time spent in POSIX threads synchronization calls, i.e., mutex and condition variable operations.
Unit:
Seconds
Diagnosis:
Expand the metric tree further to determine the proportion of time in different classes of POSIX thread synchronization operations. Expand the calltree to identify which callpaths are responsible for the most synchronization time. Also examine the distribution of synchronization time on each participating thread for indication of lock contention effects.
Parent metric:
POSIX Threads Time
Sub-metrics:
POSIX Threads Mutex API Synchronization Time
POSIX Threads Condition API Synchronization Time

POSIX Threads Mutex API Synchronization Time

(only available after remapping)
Description:
Time spent in POSIX threads mutex API calls.
Unit:
Seconds
Diagnosis:
Locate the most costly usage of mutex operations and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis). Consider re-writing the algorithm to use lock-free data structures.
Parent metric:
POSIX Threads Synchronization Time
Sub-metrics:
POSIX Threads Mutex Contention Time

POSIX Threads Mutex Contention Time

Description:
Time lost waiting for a mutex lock to be acquired while another thread still holds the corresponding lock.


Pthread Mutex Lock Contention Example

Unit:
Seconds
Diagnosis:
A large amount of waiting time in mutex locks can be an indication of too much balance, since many threads try to acquire locks almost at the same time. Examine the waiting times for each thread and try to distribute the preceding computation on the threads to allow a staggered arrival at the lock API call.
Parent metric:
POSIX Threads Mutex API Synchronization Time
Sub-metrics:
None

POSIX Threads Condition API Synchronization Time

(only available after remapping)
Description:
Time spent in POSIX threads condition API calls.
Unit:
Seconds
Diagnosis:
Locate the most costly usage of condition operations and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis). Consider re-writing the algorithm to use data structures without the need for condition variables.
Parent metric:
POSIX Threads Synchronization Time
Sub-metrics:
POSIX Threads Condition Contention Time

POSIX Threads Condition Contention Time

Description:
Time lost waiting for a mutex lock to be acquired in a condition API call while another thread still holds the corresponding lock.


Pthread Condition Lock Contention Example

Unit:
Seconds
Diagnosis:
A large amount of waiting time in condition operations can be an indication of imbalance. Examine the waiting times for each thread and try to distribute the preceding computation, in particular the work of the threads responsible for fulfilling the condition.
Parent metric:
POSIX Threads Condition API Synchronization Time
Sub-metrics:
None

OpenACC Time

(only available after remapping)
Description:
Time spent in the OpenACC run-time system, API calls and on device. If the OpenACC implementation is based on CUDA, and OpenACC and CUDA support are both enabled during measurement, the CUDA activities from within OpenACC will be accounted separately (just like CUDA calls within MPI and other metric hierarchies).
Unit:
Seconds
Parent metric:
Execution Time
Sub-metrics:
OpenACC Initialization/Finalization Time
OpenACC Memory Management Time
OpenACC Synchronization Time
OpenACC Kernel Launch Time

OpenACC Initialization/Finalization Time

(only available after remapping)
Description:
Time needed to initialize and finalize OpenACC and OpenACC kernels.
Unit:
Seconds
Parent metric:
OpenACC Time
Sub-metrics:
None

OpenACC Memory Management Time

(only available after remapping)
Description:
Time spent on memory management including data transfer from host to device and vice versa.
Unit:
Seconds
Parent metric:
OpenACC Time
Sub-metrics:
None

OpenACC Synchronization Time

(only available after remapping)
Description:
Time spent on OpenACC synchronization.
Unit:
Seconds
Parent metric:
OpenACC Time
Sub-metrics:
None

OpenACC Kernel Launch Time

(only available after remapping)
Description:
Time spent to launch OpenACC kernels.
Unit:
Seconds
Parent metric:
OpenACC Time
Sub-metrics:
None

OpenCL Kernel Time

(only available after remapping)
Description:
Time spent executing OpenCL kernels.
Unit:
Seconds
Parent metric:
Computation Time
Sub-metrics:
None

OpenCL Time

(only available after remapping)
Description:
Time spent in the OpenCL run-time system, API and on device.
Unit:
Seconds
Parent metric:
Execution Time
Sub-metrics:
OpenCL General Management Time
OpenCL Memory Management Time
OpenCL Synchronization Time
OpenCL Kernel Launch Time

OpenCL General Management Time

(only available after remapping)
Description:
Time needed for general OpenCL setup, e.g. initialization, device and event control, etc.
Unit:
Seconds
Parent metric:
OpenCL Time
Sub-metrics:
None

OpenCL Memory Management Time

(only available after remapping)
Description:
Time spent on memory management including data transfer from host to device and vice versa.
Unit:
Seconds
Parent metric:
OpenCL Time
Sub-metrics:
None

OpenCL Synchronization Time

(only available after remapping)
Description:
Time spent on OpenCL synchronization.
Unit:
Seconds
Parent metric:
OpenCL Time
Sub-metrics:
None

OpenCL Kernel Launch Time

(only available after remapping)
Description:
Time spent to launch OpenCL kernels.
Unit:
Seconds
Parent metric:
OpenCL Time
Sub-metrics:
None

CUDA Kernel Time

(only available after remapping)
Description:
Time spent executing CUDA kernels.
Unit:
Seconds
Parent metric:
Computation Time
Sub-metrics:
None

CUDA Time

(only available after remapping)
Description:
Time spent in the CUDA run-time system, API calls and on device.
Unit:
Seconds
Parent metric:
Execution Time
Sub-metrics:
CUDA General Management Time
CUDA Memory Management Time
CUDA Synchronization Time
CUDA Kernel Launch Time

CUDA General Management Time

(only available after remapping)
Description:
Time needed for general CUDA setup, e.g. initialization, control of version, device, primary context, context, streams, events, occupancy, etc.
Unit:
Seconds
Parent metric:
CUDA Time
Sub-metrics:
None

CUDA Memory Management Time

(only available after remapping)
Description:
Time spent on memory management including data transfer from host to device and vice versa. Note that "memset" operations are considered in CUDA Kernel Launch Time.
Unit:
Seconds
Parent metric:
CUDA Time
Sub-metrics:
None

CUDA Synchronization Time

(only available after remapping)
Description:
Time spent on CUDA synchronization.
Unit:
Seconds
Parent metric:
CUDA Time
Sub-metrics:
None

CUDA Kernel Launch Time

(only available after remapping)
Description:
Time spent to launch CUDA kernels, including "memset" operations.
Unit:
Seconds
Parent metric:
CUDA Time
Sub-metrics:
None

MPI Synchronization Operations

(only available after remapping)
Description:
Provides the total number of MPI synchronization operations that were executed. This not only includes barrier calls, but also communication operations which transfer no data (i.e., zero-sized messages are considered to be used for coordination synchronization).
Unit:
Counts
Parent metric:
None
Sub-metrics:
MPI Point-to-point Synchronization Operations
MPI Collective Synchronizations
MPI One-sided Synchronization Operations

MPI Point-to-point Synchronization Operations

(only available after remapping)
Description:
Total number of MPI point-to-point synchronization operations, i.e., point-to-point transfers of zero-sized messages used for coordination.
Unit:
Counts
Diagnosis:
Locate the most costly synchronizations and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis).
Parent metric:
MPI Synchronization Operations
Sub-metrics:
MPI Point-to-point Send Synchronization Operations
MPI Point-to-point Receive Synchronization Operations

MPI Point-to-point Send Synchronization Operations

Description:
Number of MPI point-to-point synchronization operations sending a zero-sized message.
Unit:
Counts
Parent metric:
MPI Point-to-point Synchronization Operations
Sub-metrics:
MPI Late Receiver Instances (Synchronizations)

MPI Point-to-point Receive Synchronization Operations

Description:
Number of MPI point-to-point synchronization operations receiving a zero-sized message.
Unit:
Counts
Parent metric:
MPI Point-to-point Synchronization Operations
Sub-metrics:
MPI Late Sender Instances (Synchronizations)

MPI Collective Synchronizations

Description:
The number of MPI collective synchronization operations. This does not only include barrier calls, but also calls to collective communication operations that are neither sending nor receiving any data. Each process participating in the operation is counted, as defined by the associated MPI communicator.
Unit:
Counts
Diagnosis:
Locate synchronizations with the largest MPI Collective Synchronization Time and determine whether they are necessary to ensure correctness or could be safely removed (based on algorithm analysis). Collective communication operations that neither send nor receive data, yet are required for synchronization, can be replaced with the more efficient MPI_Barrier.
Parent metric:
MPI Synchronization Operations
Sub-metrics:
None

MPI One-sided Synchronization Operations

(only available after remapping)
Description:
Total number of MPI one-sided synchronization operations.
Unit:
Counts
Parent metric:
MPI Synchronization Operations
Sub-metrics:
MPI One-sided Active Target Synchronization Operations
MPI One-sided Passive Target Synchronization Operations

MPI One-sided Active Target Synchronization Operations

(only available after remapping)
Description:
Total number of MPI one-sided active target synchronization operations.
Unit:
Counts
Parent metric:
MPI One-sided Synchronization Operations
Sub-metrics:
None

MPI One-sided Passive Target Synchronization Operations

(only available after remapping)
Description:
Total number of MPI one-sided passive target synchronization operations.
Unit:
Counts
Parent metric:
MPI One-sided Synchronization Operations
Sub-metrics:
None

MPI Communication Operations

(only available after remapping)
Description:
Total number of MPI communication operations, excluding calls transferring no payload data (which are considered MPI Synchronization Operations).
Unit:
Counts
Parent metric:
None
Sub-metrics:
MPI Point-to-point Communication Operations
MPI Collective Communications
MPI One-sided Communication Operations

MPI Point-to-point Communication Operations

(only available after remapping)
Description:
Total number of MPI point-to-point communication operations, excluding calls transferring zero-sized messages.
Unit:
Counts
Parent metric:
MPI Communication Operations
Sub-metrics:
MPI Point-to-point Send Communication Operations
MPI Point-to-point Receive Communication Operations

MPI Point-to-point Send Communication Operations

Description:
Number of MPI point-to-point send operations, excluding calls transferring zero-sized messages.
Unit:
Counts
Parent metric:
MPI Point-to-point Communication Operations
Sub-metrics:
MPI Late Receiver Instances (Communications)

MPI Point-to-point Receive Communication Operations

Description:
Number of MPI point-to-point receive operations, excluding calls transferring zero-sized messages.
Unit:
Counts
Parent metric:
MPI Point-to-point Communication Operations
Sub-metrics:
MPI Late Sender Instances (Communications)

MPI Collective Communications

(only available after remapping)
Description:
The number of MPI collective communication operations, excluding calls neither sending nor receiving any data. Each process participating in the operation is counted, as defined by the associated MPI communicator.
Unit:
Counts
Diagnosis:
Locate operations with the largest MPI Collective Communication Time and compare MPI Collective Bytes Transferred. Where multiple collective operations of the same type are used in series with single values or small payloads, aggregation may be beneficial in amortizing transfer overhead.
Parent metric:
MPI Communication Operations
Sub-metrics:
MPI Collective Exchange Communications
MPI Collective Communications as Source
MPI Collective Communications as Destination

MPI Collective Exchange Communications

Description:
The number of MPI collective communication operations which are both sending and receiving data. In addition to all-to-all and scan operations, root processes of certain collectives transfer data from their source to destination buffer.
Unit:
Counts
Parent metric:
MPI Collective Communications
Sub-metrics:
None

MPI Collective Communications as Source

Description:
The number of MPI collective communication operations that are only sending but not receiving data. Examples are the non-root processes in gather and reduction operations.
Unit:
Counts
Parent metric:
MPI Collective Communications
Sub-metrics:
None

MPI Collective Communications as Destination

Description:
The number of MPI collective communication operations that are only receiving but not sending data. Examples are broadcasts and scatters (for ranks other than the root).
Unit:
Counts
Parent metric:
MPI Collective Communications
Sub-metrics:
None

MPI One-sided Communication Operations

(only available after remapping)
Description:
Total number of MPI one-sided communication operations.
Unit:
Counts
Parent metric:
MPI Communication Operations
Sub-metrics:
MPI One-sided Put Communication Operations
MPI One-sided Get Communication Operations
MPI One-sided Atomic Communication Operations

MPI One-sided Put Communication Operations

(only available after remapping)
Description:
Total number of MPI one-sided put communication operations.
Unit:
Counts
Parent metric:
MPI One-sided Communication Operations
Sub-metrics:
None

MPI One-sided Get Communication Operations

(only available after remapping)
Description:
Total number of MPI one-sided get communication operations.
Unit:
Counts
Parent metric:
MPI One-sided Communication Operations
Sub-metrics:
None

MPI One-sided Atomic Communication Operations

(only available after remapping)
Description:
Total number of MPI one-sided atomic communication operations.
Unit:
Counts
Parent metric:
MPI One-sided Communication Operations
Sub-metrics:
None

MPI Bytes Transferred

(only available after remapping)
Description:
The total number of bytes that were notionally processed in MPI communication and synchronization operations (i.e., the sum of the bytes that were sent and received). Note that the actual number of bytes transferred is typically not determinable, as this is dependant on the MPI internal implementation, including message transfer and failed delivery recovery protocols.
Unit:
Bytes
Diagnosis:
Expand the metric tree to break down the bytes transferred into constituent classes. Expand the call tree to identify where most data is transferred and examine the distribution of data transferred by each process.
Parent metric:
None
Sub-metrics:
MPI Point-to-point Bytes Transferred
MPI Collective Bytes Transferred
MPI One-Sided Bytes Transferred

MPI Point-to-point Bytes Transferred

(only available after remapping)
Description:
The total number of bytes that were notionally processed by MPI point-to-point communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to identify where the most data is transferred using point-to-point communication and examine the distribution of data transferred by each process. Compare with the number of MPI Point-to-point Communication Operations and resulting MPI Point-to-point Communication Time.

Average message size can be determined by dividing by the number of MPI MPI Point-to-point Communication Operations (for all call paths or for particular call paths or communication operations). Instead of large numbers of small communications streamed to the same destination, it may be more efficient to pack data into fewer larger messages (e.g., using MPI datatypes). Very large messages may require a rendezvous between sender and receiver to ensure sufficient transmission and receipt capacity before sending commences: try splitting large messages into smaller ones that can be transferred asynchronously and overlapped with computation. (Some MPI implementations allow tuning of the rendezvous threshold and/or transmission capacity, e.g., via environment variables.)
Parent metric:
MPI Bytes Transferred
Sub-metrics:
MPI Point-to-point Bytes Sent
MPI Point-to-point Bytes Received

MPI Point-to-point Bytes Sent

Description:
The number of bytes that were notionally sent using MPI point-to-point communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is sent using point-to-point communication operations and examine the distribution of data sent by each process. Compare with the number of MPI Point-to-point Send Communication Operations and resulting MPI Point-to-point Communication Time.

If the aggregate MPI Point-to-point Bytes Received is less than the amount sent, some messages were cancelled, received into buffers which were too small, or simply not received at all. (Generally only aggregate values can be compared, since sends and receives take place on different callpaths and on different processes.) Sending more data than is received wastes network bandwidth. Applications do not conform to the MPI standard when they do not receive all messages that are sent, and the unreceived messages degrade performance by consuming network bandwidth and/or occupying message buffers. Cancelling send operations is typically expensive, since it usually generates one or more internal messages.
Parent metric:
MPI Point-to-point Bytes Transferred
Sub-metrics:
None

MPI Point-to-point Bytes Received

Description:
The number of bytes that were notionally received using MPI point-to-point communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is received using point-to-point communication and examine the distribution of data received by each process. Compare with the number of MPI Point-to-point Receive Communication Operations and resulting MPI Point-to-point Communication Time.

If the aggregate MPI Point-to-point Bytes Sent is greater than the amount received, some messages were cancelled, received into buffers which were too small, or simply not received at all. (Generally only aggregate values can be compared, since sends and receives take place on different callpaths and on different processes.) Applications do not conform to the MPI standard when they do not receive all messages that are sent, and the unreceived messages degrade performance by consuming network bandwidth and/or occupying message buffers. Cancelling receive operations may be necessary where speculative asynchronous receives are employed, however, managing the associated requests also involves some overhead.
Parent metric:
MPI Point-to-point Bytes Transferred
Sub-metrics:
None

MPI Collective Bytes Transferred

(only available after remapping)
Description:
The total number of bytes that were notionally processed in MPI collective communication operations. This assumes that collective communications are implemented naively using point-to-point communications, e.g., a broadcast being implemented as sends to each member of the communicator (including the root itself). Note that effective MPI implementations use optimized algorithms and/or special hardware, such that the actual number of bytes transferred may be very different.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using collective communication and examine the distribution of data transferred by each process. Compare with the number of MPI Collective Communications and resulting MPI Collective Communication Time.
Parent metric:
MPI Bytes Transferred
Sub-metrics:
MPI Collective Bytes Outgoing
MPI Collective Bytes Incoming

MPI Collective Bytes Outgoing

Description:
The number of bytes that were notionally sent by MPI collective communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using collective communication and examine the distribution of data outgoing from each process.
Parent metric:
MPI Collective Bytes Transferred
Sub-metrics:
None

MPI Collective Bytes Incoming

Description:
The number of bytes that were notionally received by MPI collective communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using collective communication and examine the distribution of data incoming to each process.
Parent metric:
MPI Collective Bytes Transferred
Sub-metrics:
None

MPI One-Sided Bytes Transferred

(only available after remapping)
Description:
The number of bytes that were notionally processed in MPI one-sided communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using one-sided communication and examine the distribution of data transferred by each process. Compare with the number of MPI One-sided Communication Operations and resulting MPI One-sided Communication Time.
Parent metric:
MPI Bytes Transferred
Sub-metrics:
MPI One-sided Bytes Sent
MPI One-sided Bytes Received

MPI One-sided Bytes Sent

Description:
The number of bytes that were notionally sent in MPI one-sided communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using one-sided communication and examine the distribution of data sent by each process.
Parent metric:
MPI One-Sided Bytes Transferred
Sub-metrics:
None

MPI One-sided Bytes Received

Description:
The number of bytes that were notionally received in MPI one-sided communication operations.
Unit:
Bytes
Diagnosis:
Expand the calltree to see where the most data is transferred using one-sided communication and examine the distribution of data received by each process.
Parent metric:
MPI One-Sided Bytes Transferred
Sub-metrics:
None

MPI Late Receiver Instances (Synchronizations)

Description:
Provides the total number of Late Receiver instances (see MPI Late Receiver Time for details) found in MPI point-to-point synchronization operations (i.e., zero-sized message transfers).
Unit:
Counts
Parent metric:
MPI Point-to-point Send Synchronization Operations
Sub-metrics:
None

MPI Late Sender Instances (Synchronizations)

Description:
Provides the total number of Late Sender instances (see MPI Late Sender Time for details) found in MPI point-to-point synchronization operations (i.e., zero-sized message transfers).
Unit:
Counts
Parent metric:
MPI Point-to-point Receive Synchronization Operations
Sub-metrics:
MPI Late Sender, Wrong Order Instances (Synchronizations)

MPI Late Sender, Wrong Order Instances (Synchronizations)

Description:
Provides the total number of Late Sender instances found in MPI point-to-point synchronization operations (i.e., zero-sized message transfers) where messages are received in wrong order (see also MPI Late Sender, Wrong Order Time).
Unit:
Counts
Parent metric:
MPI Late Sender Instances (Synchronizations)
Sub-metrics:
None

MPI Late Receiver Instances (Communications)

Description:
Provides the total number of Late Receiver instances (see MPI Late Receiver Time for details) found in MPI point-to-point communication operations.
Unit:
Counts
Parent metric:
MPI Point-to-point Send Communication Operations
Sub-metrics:
None

MPI Late Sender Instances (Communications)

Description:
Provides the total number of Late Sender instances (see MPI Late Sender Time for details) found in MPI point-to-point communication operations.
Unit:
Counts
Parent metric:
MPI Point-to-point Receive Communication Operations
Sub-metrics:
MPI Late Sender, Wrong Order Instances (Communications)

MPI Late Sender, Wrong Order Instances (Communications)

Description:
Provides the total number of Late Sender instances found in MPI point-to-point communication operations where messages are received in wrong order (see also MPI Late Sender, Wrong Order Time).
Unit:
Counts
Parent metric:
MPI Late Sender Instances (Communications)
Sub-metrics:
None

MPI File Operations

(only available after remapping)
Description:
Number of MPI file operations of any type.
Unit:
Counts
Diagnosis:
Expand the metric tree to see the breakdown of different classes of MPI file operation, expand the calltree to see where they occur, and look at the distribution of operations done by each process.
Parent metric:
None
Sub-metrics:
MPI Individual File Operations
MPI Collective File Operations

MPI Individual File Operations

(only available after remapping)
Description:
Number of individual MPI file operations.
Unit:
Counts
Diagnosis:
Examine the distribution of individual MPI file operations done by each process and compare with the corresponding MPI File Management Time and MPI Individual File I/O Time.
Parent metric:
MPI File Operations
Sub-metrics:
MPI Individual File Read Operations
MPI Individual File Write Operations

MPI Individual File Read Operations

(only available after remapping)
Description:
Number of individual MPI file read operations.
Unit:
Counts
Diagnosis:
Examine the callpaths where individual MPI file reads occur and the distribution of operations done by each process in them, and compare with the corresponding MPI Individual File I/O Time.
Parent metric:
MPI Individual File Operations
Sub-metrics:
None

MPI Individual File Write Operations

(only available after remapping)
Description:
Number of individual MPI file write operations.
Unit:
Counts
Diagnosis:
Examine the callpaths where individual MPI file writes occur and the distribution of operations done by each process in them, and compare with the corresponding MPI Individual File I/O Time.
Parent metric:
MPI Individual File Operations
Sub-metrics:
None

MPI Collective File Operations

(only available after remapping)
Description:
Number of collective MPI file operations.
Unit:
Counts
Diagnosis:
Examine the distribution of collective MPI file operations done by each process and compare with the corresponding MPI File Management Time and MPI Collective File I/O Time.
Parent metric:
MPI File Operations
Sub-metrics:
MPI Collective File Read Operations
MPI Collective File Write Operations

MPI Collective File Read Operations

(only available after remapping)
Description:
Number of collective MPI file read operations.
Unit:
Counts
Diagnosis:
Examine the callpaths where collective MPI file reads occur and the distribution of operations done by each process in them, and compare with the corresponding MPI Collective File I/O Time.
Parent metric:
MPI Collective File Operations
Sub-metrics:
None

MPI Collective File Write Operations

(only available after remapping)
Description:
Number of collective MPI file write operations.
Unit:
Counts
Diagnosis:
Examine the callpaths where collective MPI file writes occur and the distribution of operations done by each process in them, and compare with the corresponding MPI Collective File I/O Time.
Parent metric:
MPI Collective File Operations
Sub-metrics:
None

MPI Wait at Window Create Time

Description:
Time spent waiting in MPI_Win_create for the last process to join in the collective creation of an MPI window handle.
Unit:
Seconds
Parent metric:
MPI Window Management Time
Sub-metrics:
None

MPI Wait at Window Free Time

Description:
Time spent waiting in MPI_Win_free for the last process to join in the collective deallocation of an MPI window handle.
Unit:
Seconds
Parent metric:
MPI Window Management Time
Sub-metrics:
None

MPI One-sided Synchronization Time

(only available after remapping)
Description:
Time spent in MPI one-sided synchronization calls.
Unit:
Seconds
Parent metric:
MPI Synchronization Time
Sub-metrics:
MPI Active Target Synchronization Time
MPI One-sided Passive Target Synchronization Time

MPI Active Target Synchronization Time

(only available after remapping)
Description:
Time spent in MPI one-sided active target synchronization calls.
Unit:
Seconds
Parent metric:
MPI One-sided Synchronization Time
Sub-metrics:
MPI Late Post Time in Synchronizations
MPI Early Wait Time
MPI Wait at Fence Time

MPI Late Post Time in Synchronizations

Description:
Time spent in MPI one-sided active target access epoch synchronization operations, waiting for the corresponding exposure epoch to start.


MPI Late Post (Synchronization) Example

Unit:
Seconds
Parent metric:
MPI Active Target Synchronization Time
Sub-metrics:
None

MPI Early Wait Time

Description:
Idle time spent in MPI_Win_wait, waiting for the last corresponding exposure epoch to finish.


MPI Early Wait Example

Unit:
Seconds
Parent metric:
MPI Active Target Synchronization Time
Sub-metrics:
MPI Late Complete Time

MPI Late Complete Time

Description:
Time spent in the 'Early Wait' inefficiency pattern (see MPI Early Wait Time) due to a late completion of a corresponding access epoch. It refers to the timespan between the last RMA access and the last MPI_Win_complete call.


MPI Late Complete Example

Unit:
Seconds
Parent metric:
MPI Early Wait Time
Sub-metrics:
None

MPI Wait at Fence Time

Description:
Time spent in MPI_Win_fence waiting for other participating processes to reach the fence synchronization.


MPI Wait at Fence Example

Unit:
Seconds
Parent metric:
MPI Active Target Synchronization Time
Sub-metrics:
MPI Early Fence Time

MPI Early Fence Time

Description:
Time spent in MPI_Win_fence waiting for outstanding one-sided communication operations to this location to finish.


MPI Early Fence Example

Unit:
Seconds
Parent metric:
MPI Wait at Fence Time
Sub-metrics:
None

MPI One-sided Passive Target Synchronization Time

(only available after remapping)
Description:
Time spent in MPI one-sided passive target synchronization calls.
Unit:
Seconds
Parent metric:
MPI One-sided Synchronization Time
Sub-metrics:
MPI Lock Contention Time in Synchronizations
MPI Wait for Progress Time in Synchronizations

MPI Lock Contention Time in Synchronizations

Description:
Time spent waiting in MPI_Win_lock or MPI_Win_unlock before the lock on a window is acquired.
Unit:
Seconds
Parent metric:
MPI One-sided Passive Target Synchronization Time
Sub-metrics:
None

MPI Wait for Progress Time in Synchronizations

Description:
Time spent waiting in MPI_Win_lock or MPI_Win_unlock until the target is calling into an MPI API function that ensures remote progress.
Unit:
Seconds
Parent metric:
MPI One-sided Passive Target Synchronization Time
Sub-metrics:
None

MPI One-sided Communication Time

(only available after remapping)
Description:
Time spent in MPI one-sided communication operations, for example, MPI_Accumulate, MPI_Put, or MPI_Get.
Unit:
Seconds
Parent metric:
MPI Communication Time
Sub-metrics:
MPI Late Post Time in Communications
MPI Lock Contention Time in Communications
MPI Wait for Progress Time in Communications

MPI Late Post Time in Communications

Description:
Time spent in MPI one-sided communication operations waiting for the corresponding exposure epoch to start.


MPI Late Post (Communication) Example

Unit:
Seconds
Parent metric:
MPI One-sided Communication Time
Sub-metrics:
None

MPI Lock Contention Time in Communications

Description:
Time spent waiting in MPI_Win_lock or MPI_Win_unlock before the lock on a window is acquired.
Unit:
Seconds
Parent metric:
MPI One-sided Communication Time
Sub-metrics:
None

MPI Wait for Progress Time in Communications

Description:
Time spent waiting in MPI_Win_lock or MPI_Win_unlock until the target is calling into an MPI API function that ensures remote progress.
Unit:
Seconds
Parent metric:
MPI One-sided Communication Time
Sub-metrics:
None

Pair-wise MPI One-sided Synchronizations

Description:
MPI one-sided synchronization methods may synchronize processes when they need to ensure that no further one-sided communication operation will take place in the epoch to be closed. The MPI pair-wise one-sided synchronization metric counts the number of remote processes it potentially has to wait for at the end of this epoch, e.g., at a fence a process will wait for every other process with the same window handle in this barrier-like construct. This is required, as the target has no knowledge of whether a certain remote process has already completed its access epoch.
Unit:
Counts
Diagnosis:
A large count of pair-wise synchronizations indicate a tight coupling of the processes. A developer should then check, whether the level of inter-process coupling is needed for her algorithm, or whether an algorithm with looser coupling may be beneficial.
Parent metric:
None
Sub-metrics:
Unneeded Pair-wise MPI One-sided Synchronizations

Unneeded Pair-wise MPI One-sided Synchronizations

Description:
The unneeded pair-wise MPI one-sided synchronizations express the number of situations where a process synchronized with a remote process at the end of an epoch, although no one-sided operation from that remote process has taken place in the corresponding epoch. A synchronization therefore would not be necessary to ensure consistency, and may decrease performance through over-synchronization.
Unit:
Counts
Diagnosis:
A high number of unneeded synchronizations indicates that a different synchronization mechanism, i.e., choosing general active target synchronization (GATS) over fence, or more refined/precise access groups within GATS may be beneficial to the application's performance.
Parent metric:
Pair-wise MPI One-sided Synchronizations
Sub-metrics:
None

Computational Load Imbalance Heuristic

(only available after remapping)
Description:
This simple heuristic allows to identify computational load imbalances and is calculated for each (call-path, process/thread) pair. Its value represents the absolute difference to the average computation time. This average value is the aggregated exclusive time spent by all processes/threads in this call-path, divided by the number of processes/threads visiting it.


Computational load imbalance Example

Note: A high value for a collapsed call tree node does not necessarily mean that there is a load imbalance in this particular node, but the imbalance can also be somewhere in the subtree underneath. Unused threads outside of OpenMP parallel regions are considered to constitute OpenMP Idle Threads Time and expressly excluded from the computational load imbalance heuristic.
Unit:
Seconds
Diagnosis:
Total load imbalance comprises both above average computation time and below average computation time, therefore at most half of it could potentially be recovered with perfect (zero-overhead) load balance that distributed the excess from overloaded to unloaded processes/threads, such that all took exactly the same time.

Computation imbalance is often the origin of communication and synchronization inefficiencies, where processes/threads block and must wait idle for partners, however, work partitioning and parallelization overheads may be prohibitive for complex computations or unproductive for short computations. Replicating computation on all processes/threads will eliminate imbalance, but would typically not result in recover of this imbalance time (though it may reduce associated communication and synchronization requirements).

Call paths with significant amounts of computational imbalance should be examined, along with processes/threads with above/below-average computation time, to identify parallelization inefficiencies. Call paths executed by a subset of processes/threads may relate to parallelization that hasn't been fully realized (Computational Load Imbalance Heuristic: Non-participation), whereas call-paths executed only by a single process/thread (Computational Load Imbalance Heuristic: Single Participant) often represent unparallelized serial code, which will be scalability impediments as the number of processes/threads increase.
Parent metric:
None
Sub-metrics:
Computational Load Imbalance Heuristic: Overload
Computational Load Imbalance Heuristic: Underload

Computational Load Imbalance Heuristic: Overload

(only available after remapping)
Description:
This metric identifies processes/threads where the exclusive execution time spent for a particular call-path was above the average value. It is a complement to Computational Load Imbalance Heuristic: Underload.


Overload Example

See Computational Load Imbalance Heuristic for details on how this heuristic is calculated.
Unit:
Seconds
Diagnosis:
The CPU time which is above the average time for computation is the maximum that could potentially be recovered with perfect (zero-overhead) load balance that distributed the excess from overloaded to underloaded processes/threads.
Parent metric:
Computational Load Imbalance Heuristic
Sub-metrics:
Computational Load Imbalance Heuristic: Single Participant

Computational Load Imbalance Heuristic: Single Participant

(only available after remapping)
Description:
This heuristic distinguishes the execution time for call-paths executed by single processes/threads that potentially could be recovered with perfect parallelization using all available processes/threads.

It is the Computational Load Imbalance Heuristic: Overload time for call-paths that only have non-zero Visits for one process or thread, and complements Computational Load Imbalance Heuristic: Non-participation in Singularity.


Single participant Example

Unit:
Seconds
Diagnosis:
This time is often associated with activities done exclusively by a "Master" process/thread (often rank 0) such as initialization, finalization or I/O, but can apply to any process/thread that performs computation that none of its peers do (or that does its computation on a call-path that differs from the others).

The CPU time for singular execution of the particular call path typically presents a serial bottleneck impeding scalability as none of the other available processes/threads are being used, and they may well wait idling until the result of this computation becomes available. (Check the MPI communication and synchronization times, particularly waiting times, for proximate call paths.) In such cases, even small amounts of singular execution can have substantial impact on overall performance and parallel efficiency. With perfect partitioning and (zero-overhead) parallel execution of the computation, it would be possible to recover this time.

When the amount of time is small compared to the total execution time, or when the cost of parallelization is prohibitive, it may not be worth trying to eliminate this inefficiency. As the number of processes/threads are increased and/or total execution time decreases, however, the relative impact of this inefficiency can be expected to grow.
Parent metric:
Computational Load Imbalance Heuristic: Overload
Sub-metrics:
None

Computational Load Imbalance Heuristic: Underload

(only available after remapping)
Description:
This metric identifies processes/threads where the computation time spent for a particular call-path was below the average value. It is a complement to Computational Load Imbalance Heuristic: Overload.


Underload Example

See Computational Load Imbalance Heuristic for details on how this heuristic is calculated.
Unit:
Seconds
Diagnosis:
The CPU time which is below the average time for computation could potentially be used to reduce the excess from overloaded processes/threads with perfect (zero-overhead) load balancing.
Parent metric:
Computational Load Imbalance Heuristic
Sub-metrics:
Computational Load Imbalance Heuristic: Non-participation

Computational Load Imbalance Heuristic: Non-participation

(only available after remapping)
Description:
This heuristic distinguishes the execution time for call paths not executed by a subset of processes/threads that potentially could be used with perfect parallelization using all available processes/threads.

It is the Computational Load Imbalance Heuristic: Underload time for call paths which have zero Visits and were therefore not executed by this process/thread.


Non-participation Example

Unit:
Seconds
Diagnosis:
The CPU time used for call paths where not all processes or threads are exploited typically presents an ineffective parallelization that limits scalability, if the unused processes/threads wait idling for the result of this computation to become available. With perfect partitioning and (zero-overhead) parallel execution of the computation, it would be possible to recover this time.
Parent metric:
Computational Load Imbalance Heuristic: Underload
Sub-metrics:
Computational Load Imbalance Heuristic: Non-participation in Singularity

Computational Load Imbalance Heuristic: Non-participation in Singularity

(only available after remapping)
Description:
This heuristic distinguishes the execution time for call paths not executed by all but a single process/thread that potentially could be recovered with perfect parallelization using all available processes/threads.

It is the Computational Load Imbalance Heuristic: Underload time for call paths that only have non-zero Visits for one process/thread, and complements Computational Load Imbalance Heuristic: Single Participant.


Singularity Example

Unit:
Seconds
Diagnosis:
The CPU time for singular execution of the particular call path typically presents a serial bottleneck impeding scalability as none of the other processes/threads that are available are being used, and they may well wait idling until the result of this computation becomes available. With perfect partitioning and (zero-overhead) parallel execution of the computation, it would be possible to recover this time.
Parent metric:
Computational Load Imbalance Heuristic: Non-participation
Sub-metrics:
None

Critical Path Profile

Description:
This metric provides a profile of the application's critical path. Following the causality chain from the last active program process/thread back to the program start, the critical path shows the call paths and processes/threads that are responsible for the program's wall-clock runtime.


Critical path profile Example

Note that Scalasca does not yet consider POSIX threads when determining the critical path. Thus, the critical-path profile is currently incorrect if POSIX threads are being used, as only the master thread of each process is taken into account. However, it may still provide useful insights across processes for hybrid MPI+Pthreads applications.
Unit:
Seconds
Diagnosis:
Call paths that occupy a lot of time on the critical path are good optimization candidates. In contrast, optimizing call paths that do not appear on the critical path will not improve program runtime.

Call paths that spend a disproportionately large amount of time on the critical path with respect to their total execution time indicate parallel bottlenecks, such as load imbalance or serial execution. Use the percentage view modes and compare execution time and critical path profiles to identify such call paths.

The system tree pane shows the contribution of individual processes/threads to the critical path. However, note that the critical path runs only on one process at a time. In a well-balanced program, the critical path follows a more-or-less random course across processes and may not visit many processes at all. Therefore, a high critical-path time on individual processes does not necessarily indicate a performance problem. Exceptions are significant load imbalances or serial execution on single processes. Use the critical-path imbalance metric or compare with the distribution of execution time across processes to identify such cases.
Parent metric:
None
Sub-metrics:
Critical-Path Imbalance

Critical-Path Imbalance

(only available after remapping)
Description:
This metric highlights parallel performance bottlenecks.

In essence, the critical-path imbalance is the positive difference of the time a call path occupies on the critical path and the call path's average runtime across all CPU locations. Thus, a high critical-path imbalance identifies call paths which spend a disproportionate amount of time on the critical path.


Imbalance Example

The image above illustrates the critical-path profile and the critical-path imbalance for the example in the Critical Path Profile metric description. Note that the excess time of regions foo and baz on the critical path compared to their average execution time is marked as imbalance. While also on the critical path, region bar is perfectly balanced between the processes and therefore has no contribution to critical-path imbalance.
Unit:
Seconds
Diagnosis:
A high critical-path imbalance indicates a parallel bottleneck, such as load imbalance or serial execution. Cross-analyze with other metrics, such as the distribution of execution time across CPU locations, to identify the type and causes of the parallel bottleneck.
Parent metric:
Critical Path Profile
Sub-metrics:
None

Performance Impact

(only available after remapping)
Description:
This heuristic characterizes the performance impact of program activities (call paths) on the program as a whole. This includes the activities' direct impact on the CPU time, as well as their indirect impact through load imbalance.
Unit:
Seconds
Diagnosis:
Expand the metric tree hierarchy to identify the impact of activities on the critical path of the application compared to activities not located on the critical path. For critical-path activities, further expand the Critical-path Activities hierarchy to identify how much of the performance impact is due to imbalance rather than actual computation.

Expand the call tree to identify important callpaths and routines with the most impact on overall resource consumption.
Parent metric:
None
Sub-metrics:
Critical-path Activities
Non-critical-path Activities

Critical-path Activities

(only available after remapping)
Description:
Overall resource comsumption caused by activities that appear on the critical path. While the Critical Path Profile metric calculates a profile of the critical path and thus also highlights the processes/threads taking part in its execution, this metric aggregates the overall resource consumption associated with the execution of critical-path activities, including any waiting times on processes/threads not on the critical path.
Unit:
Seconds
Diagnosis:
Expand the metric tree hierarchy to break down the overall resource consumption into the fraction that is caused by executing the critical-path activities themselves and the resources consumed by wait states caused by imbalances in these activities.
Parent metric:
Performance Impact
Sub-metrics:
Activity Impact
Critical Imbalance Impact

Activity Impact

Description:
Resource consumption caused by executing activities that appear on the critical path.
Unit:
Seconds
Parent metric:
Critical-path Activities
Sub-metrics:
None

Critical Imbalance Impact

Description:
This heuristic maps waiting time onto activities that spend "too much" time on the critical path, highlighting imbalanced activities that are likely responsible for wait states.

Unlike the Delay Costs metric which identifies any delay which leads to a wait state at a synchronization point, the imbalance impact pinpoints inefficiencies which have a global runtime effect by mapping overall resource consumption to call paths that appear on the critical path. This allows to distinguish different types of imbalances, for example, Intra-partition Imbalance and Inter-partition Imbalance, which are especially useful for the analysis of MPMD applications.
Unit:
Seconds
Diagnosis:
A high imbalance impact indicates a parallel bottleneck. Expand the metric tree hierarchy to distinguish between intra- and inter-partition imbalances.
Parent metric:
Critical-path Activities
Sub-metrics:
Intra-partition Imbalance
Inter-partition Imbalance

Intra-partition Imbalance

(only available after remapping)
Description:
Resource consumption caused by imbalances within process partitions that perform activities on the critical path.
Unit:
Seconds
Diagnosis:
A high intra-partition imbalance impact indicates that imbalances within the dominating (MPMD) partitions cause significant wait states. Compare with the Critical-Path Imbalance and Delay Costs metrics to identify the imbalanced processes/threads.
Parent metric:
Critical Imbalance Impact
Sub-metrics:
None

Inter-partition Imbalance

Description:
Resource consumption caused by imbalances within process partitions that perform activities on the critical path.
Unit:
Seconds
Diagnosis:
A high inter-partition imbalance impact indicates a sub-optimal partitioning in MPMD applications. Compare with the Critical Path Profile to identify the delaying partition and adjust the process or workload partitioning accordingly to achieve a better load balance.

Note that in hybrid MPI+OpenMP SPMD applications, master and worker threads are also considered as different partitions.
Parent metric:
Critical Imbalance Impact
Sub-metrics:
None

Non-critical-path Activities

Description:
Overall resource comsumption caused by activities that do not appear on the critical path. As such, optimizing these activities does not improve the application runtime.
Unit:
Seconds
Parent metric:
Performance Impact
Sub-metrics:
None

Delay Costs

(only available after remapping)
Description:
This metric highlights the root causes of wait states. Root causes of wait states are regions of excess execution time — delays — that cause wait states at subsequent synchronization points. Whereas wait states represent the time spent idling at a synchronization point while waiting for the communication partner(s) to enter the communication operation, delays show which call paths caused the latecomer at a synchronization point to be late. The delay costs indicate the total amount of waiting time caused by a delay, including indirect effects of wait-states spreading along the communication chain.
Unit:
Seconds
Diagnosis:
Call paths and process/threads with high delay costs pinpoint the location of delays. In general, shift work/communication load from processes/threads with high delay costs to processes/threads with large waiting times.

Delays fall into three main categories:

  1. Computational imbalance
    Delays within computational regions, in call paths that are also present on processes/threads that exhibit wait states, indicates a computational imbalance. Improve the work load balance by shifting workload within these call paths from processes/threads that are delayed to processes/threads that are waiting.
  2. Communication imbalance
    Delay costs within communication functions indicate an imbalanced communication load or inefficient communication pattern.
  3. Inefficient parallelism
    Delays in call paths that are only present on a single or a small subset of processes/threads indicates inefficient parallelism. Reduce the time spent in such functions.
Parent metric:
None
Sub-metrics:
MPI Delay Costs
OpenMP Delay Costs

MPI Delay Costs

(only available after remapping)
Description:
Total costs and locations of delays that cause wait states in MPI operations.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
Delay Costs
Sub-metrics:
MPI Point-to-point Delay Costs
MPI Collective Delay Costs

MPI Point-to-point Delay Costs

(only available after remapping)
Description:
Costs and locations of delays that cause wait states in MPI point-to-point communication.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
MPI Delay Costs
Sub-metrics:
MPI Late Sender Delay Costs
MPI Late Receiver Delay Costs

MPI Late Sender Delay Costs

(only available after remapping)
Description:
Costs and locations of delays that cause Late Sender wait states in MPI point-to-point communication.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
MPI Point-to-point Delay Costs
Sub-metrics:
Short-term MPI Late Sender Delay Costs
Long-term MPI Late Sender Delay Costs

Short-term MPI Late Sender Delay Costs

Description:
Short-term delay costs reflect the direct effect of load or communication imbalance on MPI Late Sender wait states.
Unit:
Seconds
Diagnosis:
High short-term delay costs indicate a computation or communication overload in/on the affected call paths and processes/threads. Because of this overload, the affected processes/threads arrive late at subsequent MPI send operations, thus causing Late Sender wait states on the remote processes.


Compare with MPI Late Sender Time to identify an imbalance pattern. Try to reduce workload in the affected call paths. Alternatively, shift workload in the affected call paths from processes/threads with delay costs to processes/threads that exhibit late-sender wait states.

Parent metric:
MPI Late Sender Delay Costs
Sub-metrics:
None

Long-term MPI Late Sender Delay Costs

Description:
Long-term delay costs reflect indirect effects of load or communication imbalance on wait states. That is, they cover waiting time that was caused indirectly by wait states which themselves delay subsequent communication operations.
Unit:
Seconds
Diagnosis:
High long-term delay costs indicate that computation or communication overload in/on the affected call paths and processes/threads has far-reaching effects. That is, the wait states caused by the original computational overload spread along the communication chain to remote locations.


Try to reduce workload in the affected call paths, or shift workload from processes/threads with delay costs to processes/threads that exhibit Late Sender wait states. Try to implement a more asynchronous communication pattern that can compensate for small imbalances, e.g., by using non-blocking instead of blocking communication.

Parent metric:
MPI Late Sender Delay Costs
Sub-metrics:
None

MPI Late Receiver Delay Costs

(only available after remapping)
Description:
Costs and locations of delays that cause Late Receiver wait states in MPI point-to-point communication.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
MPI Point-to-point Delay Costs
Sub-metrics:
Short-term MPI Late Receiver Delay Costs
Long-term MPI Late Receiver Delay Costs

Short-term MPI Late Receiver Delay Costs

Description:
Short-term delay costs reflect the direct effect of load or communication imbalance on MPI Late Receiver wait states.
Unit:
Seconds
Diagnosis:
High short-term delay costs indicate a computation or communication overload in/on the affected call paths and processes/threads. Because of this overload, the affected processes/threads arrive late at subsequent MPI receive operations, thus causing Late Receiver wait states on the remote processes.


Compare with MPI Late Receiver Time to identify an imbalance pattern. Try to reduce workload in the affected call paths. Alternatively, shift workload in the affected call paths from processes/threads with delay costs to processes/threads that exhibit late-receiver wait states.

Parent metric:
MPI Late Receiver Delay Costs
Sub-metrics:
None

Long-term MPI Late Receiver Delay Costs

Description:
Long-term delay costs reflect indirect effects of load or communication imbalance on wait states. That is, they cover waiting time that was caused indirectly by wait states which themselves delay subsequent communication operations.
Unit:
Seconds
Diagnosis:
High long-term delay costs indicate that computation or communication overload in/on the affected call paths and processes/threads has far-reaching effects. That is, the wait states caused by the original computational overload spread along the communication chain to remote locations.


Try to reduce workload in the affected call paths, or shift workload from processes/threads with delay costs to processes/threads that exhibit Late Receiver wait states. Try to implement a more asynchronous communication pattern that can compensate for small imbalances, e.g. by using non-blocking instead of blocking communication.

Parent metric:
MPI Late Receiver Delay Costs
Sub-metrics:
None

MPI Collective Delay Costs

(only available after remapping)
Description:
Costs and locations of delays causing wait states in MPI collective communication.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
MPI Delay Costs
Sub-metrics:
MPI Wait at Barrier Delay Costs
MPI Wait at N x N Delay Costs
MPI Late Broadcast Delay Costs
MPI Early Reduce Delay Costs

MPI Wait at Barrier Delay Costs

(only available after remapping)
Description:
Costs and locations of delays that cause wait states in MPI barrier synchronizations.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
MPI Collective Delay Costs
Sub-metrics:
Short-term MPI Barrier Delay Costs
Long-term MPI Barrier Delay Costs

Short-term MPI Barrier Delay Costs

Description:
Short-term delay costs reflect the direct effect of load or communication imbalance on MPI barrier wait states.
Unit:
Seconds
Diagnosis:
High short-term delay costs indicate a computation or communication overload in/on the affected call paths and processes/threads. Refer to Short-term MPI Late Sender Delay Costs for more information on reducing delay costs in general.
Parent metric:
MPI Wait at Barrier Delay Costs
Sub-metrics:
None

Long-term MPI Barrier Delay Costs

Description:
Long-term delay costs reflect indirect effects of load or communication imbalance on wait states. That is, they cover waiting time that was caused indirectly by wait states which themselves delay subsequent communication operations.
Unit:
Seconds
Diagnosis:
High long-term delay costs indicate that computation or communication overload in/on the affected call paths and processes/threads has far-reaching effects. Refer to Long-term MPI Late Sender Delay Costs for more information on reducing long-term delay costs in general.
Parent metric:
MPI Wait at Barrier Delay Costs
Sub-metrics:
None

MPI Wait at N x N Delay Costs

(only available after remapping)
Description:
Costs and locations of delays that cause wait states in MPI n-to-n collective communication operations.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
MPI Collective Delay Costs
Sub-metrics:
Short-term MPI N x N Collectives Delay Costs
Long-term MPI N x N Collectives Delay Costs

Short-term MPI N x N Collectives Delay Costs

Description:
Short-term costs reflect the direct effect of load or communication imbalance on wait states in MPI n-to-n collective communication operations.
Unit:
Seconds
Diagnosis:
High short-term delay costs indicate a computation or communication overload in/on the affected call paths and processes/threads. Refer Short-term MPI Late Sender Delay Costs for more information on reducing delay costs in general.
Parent metric:
MPI Wait at N x N Delay Costs
Sub-metrics:
None

Long-term MPI N x N Collectives Delay Costs

Description:
Long-term delay costs reflect indirect effects of load or communication imbalance on wait states. That is, they cover waiting time that was caused indirectly by wait states which themselves delay subsequent communication operations.
Unit:
Seconds
Diagnosis:
High long-term delay costs indicate that computation or communication overload in/on the affected call paths and processes/threads has far-reaching effects. Refer to Long-term MPI Late Sender Delay Costs for more information on reducing long-term delay costs in general.
Parent metric:
MPI Wait at N x N Delay Costs
Sub-metrics:
None

MPI Late Broadcast Delay Costs

(only available after remapping)
Description:
Costs of delays that cause wait states in collective MPI 1-to-n communication operations.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
MPI Collective Delay Costs
Sub-metrics:
Short-term MPI 1-to-N Collectives Delay Costs
Long-term MPI 1-to-N Collectives Delay Costs

Short-term MPI 1-to-N Collectives Delay Costs

Description:
Short-term costs reflect the direct effect of load or communication imbalance on wait states in on MPI 1-to-n collectives.
Unit:
Seconds
Diagnosis:
High short-term delay costs indicate a computation or communication overload in/on the affected call paths and processes/threads. Refer Short-term MPI Late Sender Delay Costs for more information on reducing delay costs in general.
Parent metric:
MPI Late Broadcast Delay Costs
Sub-metrics:
None

Long-term MPI 1-to-N Collectives Delay Costs

Description:
Long-term delay costs reflect indirect effects of load or communication imbalance on wait states. That is, they cover waiting time that was caused indirectly by wait states which themselves delay subsequent communication operations.
Unit:
Seconds
Diagnosis:
High long-term delay costs indicate that computation or communication overload in/on the affected call paths and processes/threads has far-reaching effects. Refer to Long-term MPI Late Sender Delay Costs for more information on reducing long-term delay costs in general.
Parent metric:
MPI Late Broadcast Delay Costs
Sub-metrics:
None

MPI Early Reduce Delay Costs

(only available after remapping)
Description:
Costs of delays that cause wait states in collective MPI n-to-1 communication operations.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
MPI Collective Delay Costs
Sub-metrics:
Short-term MPI N-to-1 Collectives Delay Costs
Long-term MPI N-to-1 Collectives Delay Costs

Short-term MPI N-to-1 Collectives Delay Costs

Description:
Short-term costs reflect the direct effect of load or communication imbalance on wait states in on MPI n-to-1 collectives.
Unit:
Seconds
Diagnosis:
High short-term delay costs indicate a computation or communication overload in/on the affected call paths and processes/threads. Refer Short-term MPI Late Sender Delay Costs for more information on reducing delay costs in general.
Parent metric:
MPI Early Reduce Delay Costs
Sub-metrics:
None

Long-term MPI N-to-1 Collectives Delay Costs

Description:
Long-term delay costs reflect indirect effects of load or communication imbalance on wait states. That is, they cover waiting time that was caused indirectly by wait states which themselves delay subsequent communication operations.
Unit:
Seconds
Diagnosis:
High long-term delay costs indicate that computation or communication overload in/on the affected call paths and processes/threads has far-reaching effects. Refer to Long-term MPI Late Sender Delay Costs for more information on reducing long-term delay costs in general.
Parent metric:
MPI Early Reduce Delay Costs
Sub-metrics:
None

OpenMP Delay Costs

(only available after remapping)
Description:
Total costs and locations of delays that cause wait states in OpenMP constructs.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
Delay Costs
Sub-metrics:
OpenMP Wait at Barrier Delay Costs
OpenMP Thread Idleness Delay Costs

OpenMP Wait at Barrier Delay Costs

(only available after remapping)
Description:
Costs and locations of delays that cause wait states in OpenMP barrier synchronizations.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
OpenMP Delay Costs
Sub-metrics:
Short-term OpenMP Barrier Delay Costs
Long-term OpenMP Barrier Delay Costs

Short-term OpenMP Barrier Delay Costs

Description:
Short-term costs reflect the direct effect of load or communication imbalance on OpenMP barrier wait states.
Unit:
Seconds
Diagnosis:
High short-term delay costs indicate a computation or communication overload in/on the affected call paths and processes/threads. Refer to Short-term MPI Late Sender Delay Costs for more information on reducing delay costs in general.
Parent metric:
OpenMP Wait at Barrier Delay Costs
Sub-metrics:
None

Long-term OpenMP Barrier Delay Costs

Description:
Long-term delay costs reflect indirect effects of load or communication imbalance on wait states. That is, they cover waiting time that was caused indirectly by wait states which themselves delay subsequent communication operations.
Unit:
Seconds
Diagnosis:
High long-term delay costs indicate that computation or communication overload in/on the affected call paths and processes/threads has far-reaching effects. Refer to Long-term MPI Late Sender Delay Costs for more information on reducing long-term delay costs in general.
Parent metric:
OpenMP Wait at Barrier Delay Costs
Sub-metrics:
None

OpenMP Thread Idleness Delay Costs

(only available after remapping)
Description:
Costs and locations of delays that cause OpenMP worker threads to idle.
Unit:
Seconds
Diagnosis:
See Delay Costs for details.
Parent metric:
OpenMP Delay Costs
Sub-metrics:
Short-term OpenMP Thread Idleness Delay Costs
Long-term OpenMP Thread Idleness Delay Costs

Short-term OpenMP Thread Idleness Delay Costs

Description:
Short-term costs reflect the direct effect of sections outside of OpenMP parallel regions on thread idleness.
Unit:
Seconds
Diagnosis:
High short-term delay costs for thread idleness indicates that much time is spent outside of OpenMP parallel regions in the affected call paths.


Try to reduce workload in the affected call paths. Alternatively, apply OpenMP parallelism to more sections of the code.

Parent metric:
OpenMP Thread Idleness Delay Costs
Sub-metrics:
None

Long-term OpenMP Thread Idleness Delay Costs

Description:
Long-term delay costs reflect indirect effects of load or communication imbalance on wait states. That is, they cover waiting time that was caused indirectly by wait states which themselves delay subsequent communication operations. Here, they identify costs and locations of delays that indirectly leave OpenMP worker threads idle due to wait-state propagation. In particular, long-term idle thread delay costs indicate call paths and processes/threads that increase the time worker threads are idling because of MPI wait states outside of OpenMP parallel regions.
Unit:
Seconds
Diagnosis:
High long-term delay costs indicate that computation or communication overload in/on the affected call paths and processes/threads has far-reaching effects. That is, the wait states caused by the original computational overload spread along the communication chain to remote locations.


Try to reduce workload in the affected call paths, or shift workload from processes/threads with delay costs to processes/threads that exhibit wait states. Try to implement a more asynchronous communication pattern that can compensate for small imbalances, e.g., by using non-blocking instead of blocking communication.

Parent metric:
OpenMP Thread Idleness Delay Costs
Sub-metrics:
None

MPI Point-to-point Wait State Classification: Direct vs. Indirect

(only available after remapping)
Description:
Partitions MPI point-to-point wait states into waiting time directly caused by delays and waiting time caused by propagation.
Unit:
Seconds
Parent metric:
None
Sub-metrics:
Direct MPI Point-to-point Wait States
Indirect MPI Point-to-point Wait States

Direct MPI Point-to-point Wait States

(only available after remapping)
Description:
Waiting time in MPI point-to-point operations that results from direct delay, i.e., is directly caused by a load- or communication imbalance.
Unit:
Seconds
Parent metric:
MPI Point-to-point Wait State Classification: Direct vs. Indirect
Sub-metrics:
Direct MPI Late Sender Wait States
Direct MPI Late Receiver Wait States

Direct MPI Late Sender Wait States

(only available after remapping)
Description:
Waiting time in MPI Late Sender wait states that results from direct delay, i.e., is caused by load imbalance.
Unit:
Seconds
Parent metric:
Direct MPI Point-to-point Wait States
Sub-metrics:
None

Direct MPI Late Receiver Wait States

(only available after remapping)
Description:
Waiting time in MPI Late Receiver wait states that results from direct delay, i.e., is caused by load imbalance.
Unit:
Seconds
Parent metric:
Direct MPI Point-to-point Wait States
Sub-metrics:
None

Indirect MPI Point-to-point Wait States

(only available after remapping)
Description:
Waiting time in MPI point-to-point operations that results from indirect delay, i.e., is caused indirectly by wait-state propagation.
Unit:
Seconds
Parent metric:
MPI Point-to-point Wait State Classification: Direct vs. Indirect
Sub-metrics:
Indirect MPI Late Sender Wait States
Indirect MPI Late Receiver Wait States

Indirect MPI Late Sender Wait States

Description:
Waiting time in MPI Late Sender wait states that results from indirect delay, i.e., is caused indirectly by wait-state propagation.
Unit:
Seconds
Parent metric:
Indirect MPI Point-to-point Wait States
Sub-metrics:
None

Indirect MPI Late Receiver Wait States

Description:
Waiting time in MPI Late Receiver wait states that results from indirect delay, i.e., is caused by wait-state propagation.
Unit:
Seconds
Parent metric:
Indirect MPI Point-to-point Wait States
Sub-metrics:
None

MPI Point-to-point Wait State Classification: Propagating vs. Terminal

(only available after remapping)
Description:
Partitions MPI point-to-point waiting time into wait states that propagate further (i.e., cause wait states on other processes) and those that do not.
Unit:
Seconds
Parent metric:
None
Sub-metrics:
Propagating MPI Point-to-point Wait States
Terminal MPI Point-to-point Wait States

Propagating MPI Point-to-point Wait States

(only available after remapping)
Description:
Waiting time in MPI point-to-point operations that propagates further and causes additional waiting time on other processes.
Unit:
Seconds
Parent metric:
MPI Point-to-point Wait State Classification: Propagating vs. Terminal
Sub-metrics:
Propagating MPI Late Sender Wait States
Propagating MPI Late Receiver Wait States

Propagating MPI Late Sender Wait States

Description:
Waiting time in MPI Late Sender wait states that propagates further and causes additional waiting time on other processes.
Unit:
Seconds
Parent metric:
Propagating MPI Point-to-point Wait States
Sub-metrics:
None

Propagating MPI Late Receiver Wait States

Description:
Waiting time in MPI Late Receiver wait states that propagates further and causes additional waiting time on other processes.
Unit:
Seconds
Parent metric:
Propagating MPI Point-to-point Wait States
Sub-metrics:
None

Terminal MPI Point-to-point Wait States

(only available after remapping)
Description:
Waiting time in MPI point-to-point operations that does not propagate further.
Unit:
Seconds
Parent metric:
MPI Point-to-point Wait State Classification: Propagating vs. Terminal
Sub-metrics:
Terminal MPI Late Sender Wait States
Terminal MPI Late Receiver Wait States

Terminal MPI Late Sender Wait States

(only available after remapping)
Description:
Waiting time in MPI Late Sender wait states that does not propagate further.
Unit:
Seconds
Parent metric:
Terminal MPI Point-to-point Wait States
Sub-metrics:
None

Terminal MPI Late Receiver Wait States

(only available after remapping)
Description:
Waiting time in MPI Late Receiver wait states that does not propagate further.
Unit:
Seconds
Parent metric:
Terminal MPI Point-to-point Wait States
Sub-metrics:
None

What is remapping?

A number of additional metrics can be calculated during an analysis report postprocessing step called remapping. In addition, remapping also organizes the performance properties in a hierarchical way, which allows to examine analysis reports at different levels of granularity. The remapping step is automatically performed by the Scalasca convenience command scalasca -examine (or short square) the first time an experiment archive is examined. Thus, it should be transparent to users following the recommended workflow as described in the
Scalasca User Guide.

However, the remapping process can also be performed manually using the command-line tool cube_remap2 from the CubeLib package if necessary. This tool reads an input Cube file and generates a corresponding output Cube file according to a remapping specification. Note that this remapping specification has to be different for postprocessing runtime summaries and trace analysis reports, though. To postprocess a Score-P runtime summary report profile.cubex and create a summary.cubex report, use

    cube_remap2 -d -r `scorep-config --remap-specfile` -o summary.cubex profile.cubex
Likewise, to postprocess a Scalasca trace analysis report scout.cubex and create a trace.cubex report, use
    cube_remap2 -d -r `scalasca --remap-specfile` -o trace.cubex scout.cubex
Note that as of Score-P v5.0 and Scalasca v2.6, respectively, the remapping specification is embedded in the runtime summary and trace analysis reports if the specification file can be read from the installation directory at measurement/analysis time. In this case, the -r <file> option can be omitted from the commands above. However, this embedded specification is dropped during any Cube algebra operation (e.g., cube_cut or cube_merge).

IMPORTANT NOTE:

Remapping specifications are typically targeted towards a particular version of Score-P or Scalasca. Thus, it is highly recommended to use the remapping specification distributed with the Score-P/Scalasca version that was used to generate the input report. Otherwise the remapping may produce unexpected results.
Scalasca    Copyright © 1998–2021 Forschungszentrum Jülich GmbH, Jülich Supercomputing Centre
Copyright © 2009–2015 German Research School for Simulation Sciences GmbH, Laboratory for Parallel Programming