Intro

OpenMP是共享内存体系结构上的一个基于多线程的并行编程模型，适用于SMP共享内存多处理系统和多核处理器体系结构。支持C/C++/Fortran.

OpenMP由三部分组成

编译器指令(compiler directives)
运行时库程序(runtime routines)
环境变量(environment variables)

OpenMP’s machine model

OpenMP's machine model

OpenMP: Shared-Memory Parallel Programming Model

All processors/cores access a shared main memory.

Parallelization in OpemMP employs multiple threads.

OpenMP’s memory model

OpenMP's memory model

All threads have access to the same, globally shared memory.
Data in private memory is only accessible by the thread owning the memory.
No other thread sees the changes in private memory.
Data Transfer is through shared memory and 100% transparent to the application.

OpenMP’s Execution model

OpenMP's Execution model

OpenMP programs starts with just one thread: The Master Thread. It’s used as an Initial Thread.
Worker threads are spawned at Paralllel Region, together with the Master they form the team of threads.
In between Parallel Regions the Worker Threads are put to sleep. The OpenMP Runtime takes care of all thead management work.

Concept: fork-join model. The fundamental model behind OpenMP. It Allows for an incremental parallelization.

Parallel Region and Structured Blocks

The parallelism has to be expressed explictly

#pragma omp parallel
{
    ...
    structured block
    ...
}

structured block

exactly one entry point at the top
exactly one exit point at the bottom
Branching in or out is not allowed
terminating the program is allowed(abort/exit)

specification if num of threads

Environment variable:
1 export OMP_NUM_THREADS=...
via num_threads clause
add num_threads to the parallel construct

`Worksharing`

Loop construct
Sections/Section construct
Single Construct
Task Construct

If only the `parallel` construct is used, each thread executes the structured block.

Program speedup: `Worksharing`

OpenMP’s most common `Worksharing` construct: for

int i;

// Sequential code
for (i = 0; i < N; ++i) {
    a[i] = b[i] + c[i];
}

// OpenMP parallel region
#pragma omp parallel
{
    int id, i, Nthrds, istart, iend;
    id = omp_get_thread_num();
    Nthrds = omp_get_num_threads();
    istart = id * N / Nthrds;
    iend = (id+1) * N / Nthrds;
    if (id == Nthrds - 1) {
        iend = N;
    }

    for (i = istart; i < iend; ++i) {
        a[i] = b[i] + c[i];
    }
}

// OpenMp parallel region and 
// a worksharing for construct
#pragma omp parallel
#pragma omp for
for (i = 0; i < N; ++i) {
    a[i] = b[i] + c[i];
}

distrubution of loop iterations over all threads in a Team.

Scheduling of the distrubution can be influenced.

Loops often account for most of a program’s runtime.

Workingsharing illustration

influencing the for loop scheduling

for-construct: OpenMP allows to influence how the iterations are scheduled among the threads of the team, via the schedule clause.

schedule(static [, chunk]): Iteration space divided into blocks of chunk size, blocks are assigned to threads in a round-robin fasion. If chunk is not specified: #threads blocks.
schedule(dynamic [, chunk]): Iteration space divided into blocks of chunk(not specified: 1) size, blocks are scheduled to threads in the order in which threads finish previous blocks.
schedule(guided [, chunk]):Similar to dynamic, but block size starts with implementation-defined value, then is decreased exponentially down to chunk.
schedule(runtime): Schedule and chunk size taken from the OMP_SCHEDULE environment variable(or the runtime library)
schedule(auto) schedule is left up to the runtime to choose(does not have to be any of the above)

When to use the schedule clause

static
Pre-determined and predictable by the programmer. Least work at runtime, Scheduling done at compile time.
dynamic
Unpredictable, highly variable work per iteration. Most work at runtime. Complex scheduling logic used at runtime.

Default on most implementation is shcedule(static).

Critical Region

A Critical Region is executed by all threads, but by only one thread simultaneously(Mutual Exclusion)

int i, s= 0;
#pragma omp parallel for
for (i = 0; i < 100; ++i) {
    #pragma omp critical
    { s = s + a[i]; }
}

The Barrier Construct

Threads wait until all the threads of the current Team have reached the barrier.

#pragma omp barrier

All worksharing constructs contain an implict barrier at the end.

The Single Construct

#pragma omp single [clause]
... structured block ...

The single construct specifies that the enclosed structured block is executed by only one thread of them.(It’s up to the runtime which thread that is)
Useful for:
- I/O
- Memory allocation and deallocation, etc.

The master Construct

#pragma omp master[clause]
... structured block ...

The master construct specifies that the enclosed structured block is executed only by the master thread of team.
Note: The master construct is no worksharing construct ans does not contain an implicit barrier at the end.

The sections/section Construct

#pragma omp parallel
{
    #pragma omp sections
    {
    #pragma omp section
    x_calculation();
    #pragma omp section
    y_calculation();
    #pragma omp section
    z_calculation();
    }
}

Caches

CPU is fast

Order of 3.0 GHz
Caches
Fast, but expensive
small, order of MB

Memory is low

Order of 0.3 GHz
Large, order of GB

Thus, a good utilization if caches is crutial for good performance of HPC applications!

Data in Caches

when data is used, it is copied into caches.

The hardware always copies chunks into the cache, so called cache-lines.

This is useful, when:

the data isc used frequently(temporal locality)
consecutive data is used which is on the same cache-line(spatial locality)

caches

different threads use elements of the same cache-line
one of the threads writes to the cache-line each update will cause the cache lines to “slosh back and forth” between threads.

As a result, the cache line is moved between the threads, although there is no real data dependency.

False Sharing

double s_priv[nthreads];
#pragma omp parallel num_threads(nthreads)
{
    int t = omp_get_thread_num();
    #pragame omp for
    for (int i = 0; i < 99; ++i) {
        s_priv[t] += a[i];
    }
} // end parallel

for (i = 0; i < nthreads; ++i) {
    s += s_priv[i];
}

No performance benefit for more threads!

Reason: false sharing of s_priv
Solution: padding, so that only one variable per cache line is used.

False Sharing

/* Fasle sharing avoided */
// *8是为了保证不同线程不访问同一cacheline
double s_priv[nthreads * 8];
#pragma omp parallel num_threads(nthreads)
{
    int t = omp_get_thread_num();
    #pragame omp for
    for (int i = 0; i < 99; ++i) {
        s_priv[t * 8] += a[i];
    }
} // end parallel

for (i = 0; i < nthreads; ++i) {
    s += s_priv[i * 8];
}

False Sharing

Race Condition

threads communicate by sharing variables, unintended sharing of data causes race conditions.

Data Race: a typical OpenMP programming error, when:

two or more threads access the same memory location, and
at least one of the accessed is a write, and
the accesses are not protected by locks or critical regions, and
the accesses are not synchronized, e.g. by a barrier.

Non-deterministic occurrence: e.g. the sequence of the execution of parallel loop iteratins is non-deterministic and may change from run to run.

In many cases private clauses, barriers or critical regions are missing.

Data races are hard to find using a traditional debugger.

Use tools like Intel Inspector XE ThreadSanitizer, Archer.

/* Example of Calc Pi */
double f(dpuble x) {
    return (4.0 / (1.0 + x * x));
}

double CalcPi(int n)
{
    const double fH = 1.0 / (double) n;
    double fSum = 0.0;
    double fX;
    int i;

#pragma omp parallel for private(fX, i, fSum)
    for (i = 0; i < n; i++) {
        fX = fH * ((double)i + 0.5);
        fSum += f(fX);
    }
    return fH * fSum;
}

`atomic` directive

The statements inside the atomic must be one of the following forms:

x binop = expr
   x++
   ++x
   x--
   --x

x is an lvalue of scalar type and binop is a non-overloaded built in operator.

#pragma omp parallel
{
    double tmp, B;
    B = DOIT();
    tmp = big_ugly(B);

#pragma omp atomic
    X += tmp;
}

The `barrier` directive

All tasks created by any thread of the current `Team` are guaranteed to be completed at barrier exit.

`taskwait` directive

A stand-alone directive

#pragma omp taskwait

wait on the completion of child tasks of the current task; just direct children, not all descendant task; includes an implicit task scheduling point(TSP)

taskwait

`taskgroup` directive

attached to a structured block; completion of all descendants of the current task; TSP at the end

#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}

where clause(could only be): reduction(reduction-identifier:list-items)

taskgroup

Task Synchronization explained

task-sync-explaination

Loops with Tasks

The taskloop Construct

Task generating construct: decompose a loop into chunks, create a task for each loop chunk

#pragma omp taskloop [clause[[,] clause]...]
{structured-for-loops}

clause is one of: taskloop-construct

Taskloop decomposition approaches

taskloop-decom-approaches

if none of previous clauses is present, the number of chunks and the number of iterations per chunk is implementation defined.

Additional considerations:

The order of the creation of the loop tasks is unspecified
Taskloop creates an implicit taskgroup region; nogroup -> no implicit taskgroup region is created.

Task scheduling

Default: Tasks are `tied` to the thread that first executes them -> not neccessarily the creator.

Scheduling Constraints:

Only the thread a task is tied to can execute it
A task can only be suspended at task scheduling points: task creation, task finish, taskwait, barrier, taskyield
If task is not suspended in a barrier, executing thread can only switch to a direct descendant of all tasks tied to hte thread.

Tasks created with the `utied` clause are never tied

Resume at task scheduling points possibly by different thread
But: More freedom to the implementation, e.g., load balancing.

Unsafe use of `untied` Tasks

problem: because untied tasks may migrate between threads at any point, thread-centric constructs can yield unexpected results.

Remember when using `untied` tasks:

Avoid threadprivate variable
Avoid any use of thread-ids(e.g., omp_get_thread_num())
Be careful with critical region and locks

Simple Solution

Create a tied task region with #pragma omp task if (0)

The taskyield Directive

The taskyield directive specifies that the current task can be suspended in favor of execution of a different- task.

Hint to the runtime for optimization and/or deadlock prevention.
1 #pragma omp taskyield

Tasks and Dependencies

Task dependencies constrain execution order and times for tasks task-dependency

NUMA

Non-Uniform Memory Architecture serial-NUMA parallel-NUMA

Get info on the system topology

Before you design a strategy for thread binding, you should have a basic understanding of the system topology. Please use one of the following options on a target machine

Intel MPI’s cpuinfo tool

module switch openmpi intelmpi
cpuinfo
Delivers information about the number of sockets(=packages) and the mapping pf processor ids used by the operating system to cpu cores.

hwloc’s tool

lstopo(command line: hwloc-ls)
Display a graphical representation of the system topology, seperated into NUMA nodes, along with the mapping of processor ids used by the operating system to cpu cores and additional info on caches.

Decide for Binding Strategy

Select the right binding strategy depends not only on the topology, but also on the characteristics of your application.

putting threads far apart, i.e. on different sockets

May improve the aggregated memory bandwidth available to your application
May improve the combined cache size available to your application
May decrease performance of synchronization constructs

putting threads close together, i.e. on two adjacent cores which possibly shared some caches

May improve performance of synchronization constructs
May decrease the available memory bandwidth and cache size

OpenMP 4.0: Places + Binding Policies(1/2)

Define OpenMP Places

Set of OpenMP threads running on one or more processors
can be defined by the user, i.e. OMP_PLACES=cores

Define a set of OpenMP Thread Affinity Policies

SPREAD: spread OpenMP threads evenly among the places
CLOSE: pack OpenMP threads near master thread
MASTER: collocate OpenMP thread with master thread

Goals

user has a way to specify where to execute OpenMP threads for
locality between OpenMP threads/less false sharing/memory bandwidth

Abstract names for `OMP_PLACES`:

threads: each place corresponds to a single hardware thread on the target machine.
cores: each place corresponds to a single core (having one or more hardware threads) on the target machine.
sockets: each place corresponds to a single socket(consisting of one or more cores) on the target machine.
ll_caches(5.1): each place corresponds to a set of cores that share the last level cache.
numa_domains(5.1): each place corresponds to a set of cores for which their closest memory is: the same memory; and at a similar distance from the cores.

OpenMP 4.0: Places + Binding Policies(2/2)

placing-binding-policy2

Reference

OpenMP_in_Small_Bites

openMP

Intro

OpenMP’s machine model

OpenMP: Shared-Memory Parallel Programming Model

All processors/cores access a shared main memory.

Parallelization in OpemMP employs multiple threads.

OpenMP’s memory model

OpenMP’s Execution model

Parallel Region and Structured Blocks

The parallelism has to be expressed explictly

structured block

specification if num of threads

Worksharing

If only the parallel construct is used, each thread executes the structured block.

Program speedup: Worksharing

OpenMP’s most common Worksharing construct: for

distrubution of loop iterations over all threads in a Team.

Scheduling of the distrubution can be influenced.

Loops often account for most of a program’s runtime.

influencing the for loop scheduling

for-construct: OpenMP allows to influence how the iterations are scheduled among the threads of the team, via the schedule clause.

When to use the schedule clause

Critical Region

The Barrier Construct

The Single Construct

The master Construct

The sections/section Construct

False Sharing

Caches

CPU is fast

Caches

Memory is low

Thus, a good utilization if caches is crutial for good performance of HPC applications!

Data in Caches

when data is used, it is copied into caches.

The hardware always copies chunks into the cache, so called cache-lines.

This is useful, when:

False Sharing

False Sharing occurs when

As a result, the cache line is moved between the threads, although there is no real data dependency.

Note: False sharing is a performance problem, not a correctness issue.

No performance benefit for more threads!

Race Condition

Data Race: a typical OpenMP programming error, when:

Non-deterministic occurrence: e.g. the sequence of the execution of parallel loop iteratins is non-deterministic and may change from run to run.

In many cases private clauses, barriers or critical regions are missing.

Data races are hard to find using a traditional debugger.

atomic directive

The barrier directive

All tasks created by any thread of the current Team are guaranteed to be completed at barrier exit.

taskwait directive

A stand-alone directive

wait on the completion of child tasks of the current task; just direct children, not all descendant task; includes an implicit task scheduling point(TSP)

taskgroup directive

attached to a structured block; completion of all descendants of the current task; TSP at the end

where clause(could only be): reduction(reduction-identifier:list-items)

Task Synchronization explained

Loops with Tasks

The taskloop Construct

Task generating construct: decompose a loop into chunks, create a task for each loop chunk

Taskloop decomposition approaches

if none of previous clauses is present, the number of chunks and the number of iterations per chunk is implementation defined.

Additional considerations:

Task scheduling

Default: Tasks are tied to the thread that first executes them -> not neccessarily the creator.

Scheduling Constraints:

Tasks created with the utied clause are never tied

Unsafe use of untied Tasks

problem: because untied tasks may migrate between threads at any point, thread-centric constructs can yield unexpected results.

Remember when using untied tasks:

Simple Solution

The taskyield Directive

The taskyield directive specifies that the current task can be suspended in favor of execution of a different- task.

Tasks and Dependencies

NUMA

Get info on the system topology

Intel MPI’s cpuinfo tool

hwloc’s tool

Decide for Binding Strategy

putting threads far apart, i.e. on different sockets

`Worksharing`

If only the `parallel` construct is used, each thread executes the structured block.

Program speedup: `Worksharing`

OpenMP’s most common `Worksharing` construct: for

`atomic` directive

The `barrier` directive

All tasks created by any thread of the current `Team` are guaranteed to be completed at barrier exit.

`taskwait` directive

`taskgroup` directive

Default: Tasks are `tied` to the thread that first executes them -> not neccessarily the creator.

Tasks created with the `utied` clause are never tied

Unsafe use of `untied` Tasks

Remember when using `untied` tasks:

Abstract names for `OMP_PLACES`: