Program with 2 Processes:
Process S: 6 secs (sequential)
Process P: 60 secs (parallized)
how long with 1 core
how long with 2 cores
how long with 4 cores
what max possible speedup?
which law?
1 core: S+P = 60 + 6 = 66 secs
2 cores: S stays seq, P divided on 2 cores —> 60/2
max(6,30) = 30 secs
4 cores: S stays seq. P divided on 4 cores —> 60/4 = 15 secs
max(6,15) = 15 secs
P unendlich —> 0 secs
total: max(0,6) = 6
Speedup = original/new runtim = 66/6 = 11
Amdahl’s law
int s=0;
int A[]={2,5}
#pragma omp parallel for num_threads(2)
for(int i=0; i<2; ++i){
s=s+A[i]*A[i];
}
is s, A, i private or shared?
What is problem of execution?
give 2 solutions
s = shared, before loop
A = shared
i = private, in loop,parallel
Race Condition on S. both threads modify without synchronization
Solutions:
#pragma omp critical
#pragma omp reduction(+:s)
Thread 1 has for loop
for (i=0; i<100; ++i){
a=a*i
Thread 2 has variants:
- same as thread 1
- same as thread 1, b instead of a
- same as thread 1, c instead of a
in which variants race condition?
in which variants True sharing?
in which variants False sharing?
give 2 solutions for False sharing
race conditions (both thrads change same var, same time)
in variant 1 in both threasd
True sharing (both threads write same variable)
in variant 1 in both threads
False sharing (both threads write diff, but on same cache line)
in variant 2, they are near in same cacheline
solutions:
Cacheline Padding: extra space btw a and b —> in new cache line
Align both to new cache lines
5-Point Stencil Operation
320x320 Matrix, 16 Processes
Mapping opportunities: 320x20, 80x80
320x20 (every process gets 320x20)
every process has 2 grenzflächen
communication per block: 2*320 = 640
total for 16: 640*16=10240
80x80
eevery block has 4 grenzflächen
communication per block: 4*80 = 320
total for 16: 16*320 = 5120
OpenMP Program
1st not parallised loop
2nd parallised loop iterates over array
how is policy called that decides on NUMA System, in which cache var is saved?
how to make system betteer?
First-Touch Policy: first thread decides in which NUMA kno var saved
when sequentoal —> all values in one NUMA knot
make parallel —> better
Where is temporal localitly
where is spatial locality?
temporal locality
same memory loca. accessed repeatedly in short time
—> B
spatial locality
accesses in memory loc that are close to each other
—> A
Where is Amdahls law?
Where is Gustafsons law?
Amdahls law
increasing number of cores + no linear performance
—> N = 1000
Gustafsons law
problem size increases, scaling with more cores improves significantly
—> N = 4000
6D hypercube and 3D Torus
how many nodes
how many edges
Bisection band width?
6D hypercube
2^6 = 64 nodes
(6*2^6)/2 = 192 edges
2^(d-1) = 2^(6-1) = 32 links
3D Torus
4*4*4 = 64 nodes
(6*64)/2 = 192 edges
2 * 16 = 32 links
How to make more efficient?
dynamic scheduling —> improve Load Balancing
dynamic instead of static
Parallise inner loop —> better Load Distribution
collapse(2) before schedule
What Principle explains Observed Behavior?
Strong Scaling: problem size remains constant, but work distributed over more cores —> reduces time
Amdahls law: '# cores increase,
What value does P0 read for v at operations D of there is no cache coherence?
What value does P1 load into cache at operation E if there is protocol?
Explain Update-Based and Invalidation-Based Cache Coherence Protocolls
P0 still holds old value v=10
P1 loads v=33
Update-Based
write-update protocol —> all caches receive v=33 immediately
Invalidation Based
old values invalidated —> P1 fetches v=3 when needed
What is the arithmethic intensitiy of the app?
FLOPs per memory access
What are sparse matrices and where do they occur?
matrices where most elements zero
used: Scientific Computing, Machine Learning
sequeutnial part: 40% —> s=0.4
parallel part: 60% —> s=0.6
number processors: N=6
what is speedup with 6 processors?
what is efficiency with 6 processors?
what is max speedup if number processors is unendlich?
Speedup with 6 processors
Amdahl law: 1/(S+P/N) = 1/(0.4+0.6/6) = 1/0.4+0.1 = 1/0.5 = 2
Efficiency with 6 processors
speedup/#processors = 2/6 = 0.33
max speedup
1/S+0 = 1/0.4 = 2.5
What scaling in both?
Why lines derivate in 2nd?
first: Weak Scaling
second: Strong Scaling
why derivate? —> overhead of comm. + synchr. dominates
—> reduce efficiency
2 OpenMP worksharing constructs
2 approaches for mutual exclustion in openMP
OpenMP worksharing constructs
#pragma omp for - parallelizes loops
#pragma omp selections - runs diff tasks in parallel
mututal exclusoni
#pragma omp atomic
OpenMP clauses
clauses for ‘for’-qualifier?
name 2 possible clauses
explanation of clauses
when which clause?
clauses for ‘for’-qualifier control loop behavior btw threads
schedule(dynamic,2)
divide iterations in 2-chunks. when threads finished, get new
good, when unequal runtime per iteration
reduction(+:sum)
every thread has private copy, at end all together
good in parallel accumulation
#include <stdio.h>
#include <omp.h>
void f(), g0(), g1(), g2(), h(); // assumed to be defined elsewhere
int main() {
// Dummy variables to track dependencies
int a = 0, b = 0, c = 0, d = 0, e = 0;
#pragma omp parallel
{
#pragma omp single
// Task f → produces "a"
#pragma omp task depend(out: a)
#pragma omp task
depend(out: a)
f();
// Task g1 → waits for "a", produces "b" and "c"
#pragma omp task depend(in: a) depend(out: b, c)
depend(in: a) depend(out: b, c)
g1();
// Task g0 → depends on "b", produces "d"
#pragma omp task depend(in: b) depend(out: d)
depend(in: b) depend(out: d)
g0();
// Task g2 → depends on "c", produces "e"
#pragma omp task depend(in: c) depend(out: e)
depend(in: c) depend(out: e)
g2();
// Task h → depends on "d" and "e"
#pragma omp task depend(in: d, e)
depend(in: d, e)
h();
return 0;
examples for MPI functions
two-sided
one-sided
collective
two-sided: MPI_Send(), MPI_Recv()
one-sided: MPI_Put(), MPI_Get()
collective: MPI_Bcast(), MPI_Reduce(), MPI_Gather()
4 MPI communication modes for two-sided point-to-point communcations
standard MPI_Send()
synchron MPI_Ssend()
buffered MPI_Bsend()
ready MPI_Rsend()
what are
MPI communicators
MPI tags
MPI window objects
group of processes that can communicate
integer labels used to identify messages
memory region that other processes can access directly, w/o matching Recv()
What is used in 2-sided, 1-sided and collective comm?
Communicators
Tags
Window Objects
None
All
2-sided: Communicators, Tags
1-sided: Communicators, Window Objects
collective: Communicators
// Process 0
Send(1);
Recv(1);
// Process 1
Send(0);
Recv(0);
No deadlock, deadlock guaranteed, or deadlock possible?
Deadlock guaranteed!!
both processes block at Send, waiting for other to Receive
no deadlock!
one side ready to receive, others send complete
Recv(2);
Send(2);
// Process 2
deadlock guaranteed!!
each process wait for other to receive, deadlock cycle
no deadlock
all start with Recv, they will block, but no send is called
program hangs, but no deadlock
What is Hierachical hardware of GPU, CUDA and OpenMP?
GPU
Device —> SMs —> Warps —> Threads
CUDA
Stream —> Grid —> Thread Block —> CUDA Threads
OpenMP
Master Thread —> Parallel Threads —> Worksharing constructs
MPI, OpenMP, PGAS, NUMA
MPI = distributed memory-system
each MPI process has own memory space
perfect for using all recources across machine
OpenMP = shared-memory system
all threads access same memory
PGAS = Partitioned Global Address Space
combines global addreass space with partitioning
NUMA = Non-Uniform Memory Access
memory physically distributed
OpenMP and PGAS sensitive to memory locality
7D hypercube
diameter?
shortest route (0,0,1,1,0,1,0) to (0,0,1,1,1,0,1)
with optimal broadcast, what min steps requierd to send data from (0, . . .,0) to all other nodes
combine to 8D hypercube
how many additional nodes
how many additional links
how existing greyscheme extended
diameter: 7
shortest route: 2 bits different —> 2 min steps
min steps: 7
additional nodes:
7D: 2^7 = 128
8D: 2^8 = 256 —> additional: 128
additional links: 128 (each new node to 1 exist node)
add new leading bit: Prepend 0 to old, 1 to new nodes
Blocking and non-Blocking communication
Blocking
not return until operation completed/safe
MPI_Send(), MPI_Recv(), …
easy to deadlock
Non-Blocking
return immediately
MPI_Isend(), MPI_Irecv()
later must call MPI_Wait()
—> avoid deadlock
What is Computational Intensity
what is high/low intensitiy
number of floating point operation (FLOPs) per byte of memory accessed
—> tells us how much work a program does per memory access
high intensity —> compute-bound —> GPU friendly
low intensity —> memory-bound —> suffers from low memory
HPC main approaches
Shared memory
OpenMP, pthreads, ..
threads run on multiple cores of same machine
data is shared implicitiyl
distributed memory
MPI
processes can be on different mchines/nodes
must explicitly send/recv data
is this safe? how fix
int x = 0;
#pragma omp parallel for
for (int i = 0; i < 4; i++) {
x += i;
#pragma omp parallel for reduction(+:x)
undeferred and deferrred tasks in OpenMP
Undeferred task
encountering task suspended and new task executed immediately
(by some thread)
deferred task
ecnountering task suspended and new task executed immediately
(by same thread)
Zuletzt geändertvor 19 Tagen