Posts openMP
Post
Cancel

openMP

Intro

OpenMP是共享内存体系结构上的一个基于多线程并行编程模型,适用于SMP共享内存多处理系统和多核处理器体系结构。支持C/C++/Fortran.

OpenMP由三部分组成

  • 编译器指令(compiler directives)
  • 运行时库程序(runtime routines)
  • 环境变量(environment variables)

OpenMP’s machine model

OpenMP's machine model

OpenMP: Shared-Memory Parallel Programming Model

All processors/cores access a shared main memory.

Parallelization in OpemMP employs multiple threads.

OpenMP’s memory model

OpenMP's memory model

  • All threads have access to the same, globally shared memory.
  • Data in private memory is only accessible by the thread owning the memory.
  • No other thread sees the changes in private memory.
  • Data Transfer is through shared memory and 100% transparent to the application.

OpenMP’s Execution model

OpenMP's Execution model

  • OpenMP programs starts with just one thread: The Master Thread. It’s used as an Initial Thread.

  • Worker threads are spawned at Paralllel Region, together with the Master they form the team of threads.
  • In between Parallel Regions the Worker Threads are put to sleep. The OpenMP Runtime takes care of all thead management work.

Concept: fork-join model. The fundamental model behind OpenMP. It Allows for an incremental parallelization.

Parallel Region and Structured Blocks

The parallelism has to be expressed explictly

1
2
3
4
5
6
#pragma omp parallel
{
    ...
    structured block
    ...
}

structured block

  • exactly one entry point at the top
  • exactly one exit point at the bottom
  • Branching in or out is not allowed
  • terminating the program is allowed(abort/exit)

specification if num of threads

  • Environment variable:
    1
    
      export OMP_NUM_THREADS=...
    
  • via num_threads clause

    add num_threads to the parallel construct

Worksharing

  • Loop construct
  • Sections/Section construct
  • Single Construct
  • Task Construct

If only the parallel construct is used, each thread executes the structured block.

Program speedup: Worksharing

OpenMP’s most common Worksharing construct: for

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
int i;

// Sequential code
for (i = 0; i < N; ++i) {
    a[i] = b[i] + c[i];
}

// OpenMP parallel region
#pragma omp parallel
{
    int id, i, Nthrds, istart, iend;
    id = omp_get_thread_num();
    Nthrds = omp_get_num_threads();
    istart = id * N / Nthrds;
    iend = (id+1) * N / Nthrds;
    if (id == Nthrds - 1) {
        iend = N;
    }

    for (i = istart; i < iend; ++i) {
        a[i] = b[i] + c[i];
    }
}

// OpenMp parallel region and 
// a worksharing for construct
#pragma omp parallel
#pragma omp for
for (i = 0; i < N; ++i) {
    a[i] = b[i] + c[i];
}

distrubution of loop iterations over all threads in a Team.

Scheduling of the distrubution can be influenced.

Loops often account for most of a program’s runtime.

Workingsharing illustration

influencing the for loop scheduling

for-construct: OpenMP allows to influence how the iterations are scheduled among the threads of the team, via the schedule clause.

  • schedule(static [, chunk]): Iteration space divided into blocks of chunk size, blocks are assigned to threads in a round-robin fasion. If chunk is not specified: #threads blocks.
  • schedule(dynamic [, chunk]): Iteration space divided into blocks of chunk(not specified: 1) size, blocks are scheduled to threads in the order in which threads finish previous blocks.
  • schedule(guided [, chunk]):Similar to dynamic, but block size starts with implementation-defined value, then is decreased exponentially down to chunk.
  • schedule(runtime): Schedule and chunk size taken from the OMP_SCHEDULE environment variable(or the runtime library)
  • schedule(auto) schedule is left up to the runtime to choose(does not have to be any of the above)

When to use the schedule clause

  • static

    Pre-determined and predictable by the programmer. Least work at runtime, Scheduling done at compile time.

  • dynamic

    Unpredictable, highly variable work per iteration. Most work at runtime. Complex scheduling logic used at runtime.

Default on most implementation is shcedule(static).

Critical Region

A Critical Region is executed by all threads, but by only one thread simultaneously(Mutual Exclusion)

1
2
3
4
5
6
int i, s= 0;
#pragma omp parallel for
for (i = 0; i < 100; ++i) {
    #pragma omp critical
    { s = s + a[i]; }
}

The Barrier Construct

Threads wait until all the threads of the current Team have reached the barrier.

1
#pragma omp barrier

All worksharing constructs contain an implict barrier at the end.

The Single Construct

1
2
#pragma omp single [clause]
... structured block ...
  • The single construct specifies that the enclosed structured block is executed by only one thread of them.(It’s up to the runtime which thread that is)

  • Useful for:

    • I/O
    • Memory allocation and deallocation, etc.

The master Construct

1
2
#pragma omp master[clause]
... structured block ...
  • The master construct specifies that the enclosed structured block is executed only by the master thread of team.

  • Note: The master construct is no worksharing construct ans does not contain an implicit barrier at the end.

The sections/section Construct

1
2
3
4
5
6
7
8
9
10
11
12
#pragma omp parallel
{
    #pragma omp sections
    {
    #pragma omp section
    x_calculation();
    #pragma omp section
    y_calculation();
    #pragma omp section
    z_calculation();
    }
}

False Sharing

Caches

CPU is fast

  • Order of 3.0 GHz

    Caches

  • Fast, but expensive
  • small, order of MB

Memory is low

  • Order of 0.3 GHz
  • Large, order of GB

Thus, a good utilization if caches is crutial for good performance of HPC applications!

Data in Caches

when data is used, it is copied into caches.

The hardware always copies chunks into the cache, so called cache-lines.

This is useful, when:

  • the data isc used frequently(temporal locality)
  • consecutive data is used which is on the same cache-line(spatial locality)

caches

False Sharing

False Sharing occurs when

  • different threads use elements of the same cache-line
  • one of the threads writes to the cache-line each update will cause the cache lines to “slosh back and forth” between threads.

As a result, the cache line is moved between the threads, although there is no real data dependency.

Note: False sharing is a performance problem, not a correctness issue.

False Sharing

1
2
3
4
5
6
7
8
9
10
11
12
13
double s_priv[nthreads];
#pragma omp parallel num_threads(nthreads)
{
    int t = omp_get_thread_num();
    #pragame omp for
    for (int i = 0; i < 99; ++i) {
        s_priv[t] += a[i];
    }
} // end parallel

for (i = 0; i < nthreads; ++i) {
    s += s_priv[i];
}

No performance benefit for more threads!

  • Reason: false sharing of s_priv
  • Solution: padding, so that only one variable per cache line is used.

False Sharing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
/* Fasle sharing avoided */
// *8是为了保证不同线程不访问同一cacheline
double s_priv[nthreads * 8];
#pragma omp parallel num_threads(nthreads)
{
    int t = omp_get_thread_num();
    #pragame omp for
    for (int i = 0; i < 99; ++i) {
        s_priv[t * 8] += a[i];
    }
} // end parallel

for (i = 0; i < nthreads; ++i) {
    s += s_priv[i * 8];
}

False Sharing

Race Condition

threads communicate by sharing variables, unintended sharing of data causes race conditions.

Data Race: a typical OpenMP programming error, when:

  • two or more threads access the same memory location, and
  • at least one of the accessed is a write, and
  • the accesses are not protected by locks or critical regions, and
  • the accesses are not synchronized, e.g. by a barrier.

Non-deterministic occurrence: e.g. the sequence of the execution of parallel loop iteratins is non-deterministic and may change from run to run.

In many cases private clauses, barriers or critical regions are missing.

Data races are hard to find using a traditional debugger.

  • Use tools like Intel Inspector XE ThreadSanitizer, Archer. inspector XE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/* Example of Calc Pi */
double f(dpuble x) {
    return (4.0 / (1.0 + x * x));
}

double CalcPi(int n)
{
    const double fH = 1.0 / (double) n;
    double fSum = 0.0;
    double fX;
    int i;

#pragma omp parallel for private(fX, i, fSum)
    for (i = 0; i < n; i++) {
        fX = fH * ((double)i + 0.5);
        fSum += f(fX);
    }
    return fH * fSum;
}

atomic directive

The statements inside the atomic must be one of the following forms:

1
2
3
4
5
6
7
x binop = expr
   x++
   ++x
   x--
   --x

x is an lvalue of scalar type and binop is a non-overloaded built in operator.
1
2
3
4
5
6
7
8
9
#pragma omp parallel
{
    double tmp, B;
    B = DOIT();
    tmp = big_ugly(B);

#pragma omp atomic
    X += tmp;
}

The barrier directive

All tasks created by any thread of the current Team are guaranteed to be completed at barrier exit.

taskwait directive

A stand-alone directive

1
#pragma omp taskwait

wait on the completion of child tasks of the current task; just direct children, not all descendant task; includes an implicit task scheduling point(TSP)

taskwait

taskgroup directive

attached to a structured block; completion of all descendants of the current task; TSP at the end

1
2
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}

where clause(could only be): reduction(reduction-identifier:list-items)

taskgroup

Task Synchronization explained

task-sync-explaination

Loops with Tasks

The taskloop Construct

Task generating construct: decompose a loop into chunks, create a task for each loop chunk

1
2
#pragma omp taskloop [clause[[,] clause]...]
{structured-for-loops}

clause is one of: taskloop-construct

Taskloop decomposition approaches

taskloop-decom-approaches

if none of previous clauses is present, the number of chunks and the number of iterations per chunk is implementation defined.

Additional considerations:

  • The order of the creation of the loop tasks is unspecified
  • Taskloop creates an implicit taskgroup region; nogroup -> no implicit taskgroup region is created.

Task scheduling

Default: Tasks are tied to the thread that first executes them -> not neccessarily the creator.

Scheduling Constraints:

  • Only the thread a task is tied to can execute it
  • A task can only be suspended at task scheduling points: task creation, task finish, taskwait, barrier, taskyield
  • If task is not suspended in a barrier, executing thread can only switch to a direct descendant of all tasks tied to hte thread.

Tasks created with the utied clause are never tied

  • Resume at task scheduling points possibly by different thread
  • But: More freedom to the implementation, e.g., load balancing.

Unsafe use of untied Tasks

problem: because untied tasks may migrate between threads at any point, thread-centric constructs can yield unexpected results.

Remember when using untied tasks:

  • Avoid threadprivate variable
  • Avoid any use of thread-ids(e.g., omp_get_thread_num())
  • Be careful with critical region and locks

Simple Solution

  • Create a tied task region with #pragma omp task if (0)

The taskyield Directive

The taskyield directive specifies that the current task can be suspended in favor of execution of a different- task.

  • Hint to the runtime for optimization and/or deadlock prevention.
    1
    
    #pragma omp taskyield
    

    taskyield

Tasks and Dependencies

Task dependencies constrain execution order and times for tasks task-dependency

NUMA

Non-Uniform Memory Architecture serial-NUMA parallel-NUMA

Get info on the system topology

Before you design a strategy for thread binding, you should have a basic understanding of the system topology. Please use one of the following options on a target machine

Intel MPI’s cpuinfo tool

  • module switch openmpi intelmpi
  • cpuinfo
  • Delivers information about the number of sockets(=packages) and the mapping pf processor ids used by the operating system to cpu cores.

hwloc’s tool

  • lstopo(command line: hwloc-ls)
  • Display a graphical representation of the system topology, seperated into NUMA nodes, along with the mapping of processor ids used by the operating system to cpu cores and additional info on caches.

Decide for Binding Strategy

Select the right binding strategy depends not only on the topology, but also on the characteristics of your application.

putting threads far apart, i.e. on different sockets

  • May improve the aggregated memory bandwidth available to your application
  • May improve the combined cache size available to your application
  • May decrease performance of synchronization constructs

putting threads close together, i.e. on two adjacent cores which possibly shared some caches

  • May improve performance of synchronization constructs
  • May decrease the available memory bandwidth and cache size

OpenMP 4.0: Places + Binding Policies(1/2)

Define OpenMP Places

  • Set of OpenMP threads running on one or more processors
  • can be defined by the user, i.e. OMP_PLACES=cores

Define a set of OpenMP Thread Affinity Policies

  • SPREAD: spread OpenMP threads evenly among the places
  • CLOSE: pack OpenMP threads near master thread
  • MASTER: collocate OpenMP thread with master thread

Goals

  • user has a way to specify where to execute OpenMP threads for
  • locality between OpenMP threads/less false sharing/memory bandwidth

Abstract names for OMP_PLACES:

  • threads: each place corresponds to a single hardware thread on the target machine.
  • cores: each place corresponds to a single core (having one or more hardware threads) on the target machine.
  • sockets: each place corresponds to a single socket(consisting of one or more cores) on the target machine.
  • ll_caches(5.1): each place corresponds to a set of cores that share the last level cache.
  • numa_domains(5.1): each place corresponds to a set of cores for which their closest memory is: the same memory; and at a similar distance from the cores.

OpenMP 4.0: Places + Binding Policies(2/2)

placing-binding-policy2

Reference

This post is licensed under CC BY 4.0 by the author.