Klausurfragen

Buffl

PHPC

von vi V.

Program with 2 Processes:

Process S: 6 secs (sequential)

Process P: 60 secs (parallized)

how long with 1 core
how long with 2 cores
how long with 4 cores
what max possible speedup?
which law?

1 core: S+P = 60 + 6 = 66 secs
2 cores: S stays seq, P divided on 2 cores —> 60/2
max(6,30) = 30 secs
4 cores: S stays seq. P divided on 4 cores —> 60/4 = 15 secs
max(6,15) = 15 secs
P unendlich —> 0 secs
total: max(0,6) = 6
Speedup = original/new runtim = 66/6 = 11
Amdahl’s law

int s=0;

int A[]={2,5}

#pragma omp parallel for num_threads(2)

for(int i=0; i<2; ++i){

s=s+A[i]*A[i];

}

is s, A, i private or shared?
What is problem of execution?
give 2 solutions

s = shared, before loop

A = shared

i = private, in loop,parallel

Race Condition on S. both threads modify without synchronization

Solutions:

#pragma omp critical
#pragma omp reduction(+:s)

Thread 1 has for loop

for (i=0; i<100; ++i){

a=a*i

}

Thread 2 has variants:

- same as thread 1

- same as thread 1, b instead of a

- same as thread 1, c instead of a

in which variants race condition?
in which variants True sharing?
in which variants False sharing?
give 2 solutions for False sharing

race conditions (both thrads change same var, same time)
in variant 1 in both threasd

True sharing (both threads write same variable)
in variant 1 in both threads

False sharing (both threads write diff, but on same cache line)
in variant 2, they are near in same cacheline

solutions:
- Cacheline Padding: extra space btw a and b —> in new cache line
- Align both to new cache lines

5-Point Stencil Operation

320x320 Matrix, 16 Processes

Mapping opportunities: 320x20, 80x80

320x20 (every process gets 320x20)
every process has 2 grenzflächen
communication per block: 2*320 = 640
total for 16: 640*16=10240

80x80
eevery block has 4 grenzflächen
communication per block: 4*80 = 320
total for 16: 16*320 = 5120

OpenMP Program

1st not parallised loop

2nd parallised loop iterates over array

how is policy called that decides on NUMA System, in which cache var is saved?
how to make system betteer?

First-Touch Policy: first thread decides in which NUMA kno var saved
when sequentoal —> all values in one NUMA knot
make parallel —> better

Where is temporal localitly

where is spatial locality?

temporal locality

same memory loca. accessed repeatedly in short time

—> B

spatial locality

accesses in memory loc that are close to each other

—> A

Where is Amdahls law?

Where is Gustafsons law?

Amdahls law

increasing number of cores + no linear performance

—> N = 1000

Gustafsons law

problem size increases, scaling with more cores improves significantly

—> N = 4000

6D hypercube and 3D Torus

how many nodes
how many edges
Bisection band width?

6D hypercube

2^6 = 64 nodes
(6*2^6)/2 = 192 edges
2^(d-1) = 2^(6-1) = 32 links

3D Torus

4*4*4 = 64 nodes
(6*64)/2 = 192 edges
2 * 16 = 32 links

How to make more efficient?

dynamic scheduling —> improve Load Balancing
dynamic instead of static
Parallise inner loop —> better Load Distribution
collapse(2) before schedule

What Principle explains Observed Behavior?

Strong Scaling: problem size remains constant, but work distributed over more cores —> reduces time
Amdahls law: '# cores increase,

What value does P0 read for v at operations D of there is no cache coherence?
What value does P1 load into cache at operation E if there is protocol?
Explain Update-Based and Invalidation-Based Cache Coherence Protocolls

P0 still holds old value v=10
P1 loads v=33

Update-Based

write-update protocol —> all caches receive v=33 immediately

Invalidation Based

old values invalidated —> P1 fetches v=3 when needed

What is the arithmethic intensitiy of the app?

FLOPs per memory access

What are sparse matrices and where do they occur?

matrices where most elements zero

used: Scientific Computing, Machine Learning

sequeutnial part: 40% —> s=0.4

parallel part: 60% —> s=0.6

number processors: N=6

what is speedup with 6 processors?
what is efficiency with 6 processors?
what is max speedup if number processors is unendlich?

Speedup with 6 processors
Amdahl law: 1/(S+P/N) = 1/(0.4+0.6/6) = 1/0.4+0.1 = 1/0.5 = 2
Efficiency with 6 processors
speedup/#processors = 2/6 = 0.33
max speedup
1/S+0 = 1/0.4 = 2.5

What scaling in both?

Why lines derivate in 2nd?

first: Weak Scaling

second: Strong Scaling

why derivate? —> overhead of comm. + synchr. dominates

—> reduce efficiency

2 OpenMP worksharing constructs
2 approaches for mutual exclustion in openMP

OpenMP worksharing constructs
- #pragma omp for - parallelizes loops
- #pragma omp selections - runs diff tasks in parallel

mututal exclusoni
- #pragma omp critical
- #pragma omp atomic

OpenMP clauses

clauses for ‘for’-qualifier?
name 2 possible clauses
explanation of clauses
when which clause?

clauses for ‘for’-qualifier control loop behavior btw threads

schedule(dynamic,2)
divide iterations in 2-chunks. when threads finished, get new
good, when unequal runtime per iteration

reduction(+:sum)
every thread has private copy, at end all together
good in parallel accumulation

#include <stdio.h>

#include <omp.h>

void f(), g0(), g1(), g2(), h(); // assumed to be defined elsewhere

int main() {

// Dummy variables to track dependencies

int a = 0, b = 0, c = 0, d = 0, e = 0;

#pragma omp parallel

{

#pragma omp single

{

// Task f → produces "a"

#pragma omp task depend(out: a)

f();

// Task g1 → waits for "a", produces "b" and "c"

#pragma omp task depend(in: a) depend(out: b, c)

g1();

// Task g0 → depends on "b", produces "d"

#pragma omp task depend(in: b) depend(out: d)

g0();

// Task g2 → depends on "c", produces "e"

#pragma omp task depend(in: c) depend(out: e)

g2();

// Task h → depends on "d" and "e"

#pragma omp task depend(in: d, e)

h();

}

return 0;

}

examples for MPI functions

two-sided
one-sided
collective

two-sided: MPI_Send(), MPI_Recv()
one-sided: MPI_Put(), MPI_Get()
collective: MPI_Bcast(), MPI_Reduce(), MPI_Gather()

4 MPI communication modes for two-sided point-to-point communcations

standard MPI_Send()
synchron MPI_Ssend()
buffered MPI_Bsend()
ready MPI_Rsend()

what are

MPI communicators
MPI tags
MPI window objects

MPI communicators
group of processes that can communicate
MPI tags
integer labels used to identify messages
MPI window objects
memory region that other processes can access directly, w/o matching Recv()

What is used in 2-sided, 1-sided and collective comm?

Communicators
Tags
Window Objects
None
All

2-sided: Communicators, Tags
1-sided: Communicators, Window Objects
collective: Communicators

// Process 0

Send(1);

Recv(1);

// Process 1

Send(0);

Recv(0);

No deadlock, deadlock guaranteed, or deadlock possible?

Deadlock guaranteed!!

both processes block at Send, waiting for other to Receive

// Process 0

Send(1);

Recv(1);

// Process 1

Recv(0);

Send(0);

No deadlock, deadlock guaranteed, or deadlock possible?

no deadlock!

one side ready to receive, others send complete

// Process 0

Send(1);

Recv(2);

// Process 1

Send(2);

Recv(0);

// Process 2

Send(0);

Recv(1);

No deadlock, deadlock guaranteed, or deadlock possible?

deadlock guaranteed!!

each process wait for other to receive, deadlock cycle

// Process 0

Recv(2);

Send(1);

// Process 1

Recv(0);

Send(2);

// Process 2

Recv(1);

Send(0);

No deadlock, deadlock guaranteed, or deadlock possible?

no deadlock

all start with Recv, they will block, but no send is called

program hangs, but no deadlock

What is Hierachical hardware of GPU, CUDA and OpenMP?

GPU

Device —> SMs —> Warps —> Threads

CUDA

Stream —> Grid —> Thread Block —> CUDA Threads

OpenMP

Master Thread —> Parallel Threads —> Worksharing constructs

MPI, OpenMP, PGAS, NUMA

MPI = distributed memory-system

each MPI process has own memory space

perfect for using all recources across machine

OpenMP = shared-memory system

all threads access same memory

PGAS = Partitioned Global Address Space

combines global addreass space with partitioning

NUMA = Non-Uniform Memory Access

memory physically distributed

OpenMP and PGAS sensitive to memory locality

7D hypercube

diameter?
shortest route (0,0,1,1,0,1,0) to (0,0,1,1,1,0,1)
with optimal broadcast, what min steps requierd to send data from (0, . . .,0) to all other nodes

combine to 8D hypercube

how many additional nodes
how many additional links
how existing greyscheme extended

7D hypercube

diameter: 7
shortest route: 2 bits different —> 2 min steps
min steps: 7

combine to 8D hypercube

additional nodes:
- 7D: 2^7 = 128
- 8D: 2^8 = 256 —> additional: 128
additional links: 128 (each new node to 1 exist node)
add new leading bit: Prepend 0 to old, 1 to new nodes

Blocking and non-Blocking communication

Blocking

not return until operation completed/safe

MPI_Send(), MPI_Recv(), …

easy to deadlock

Non-Blocking

return immediately

MPI_Isend(), MPI_Irecv()

later must call MPI_Wait()

—> avoid deadlock

What is Computational Intensity

what is high/low intensitiy

number of floating point operation (FLOPs) per byte of memory accessed

—> tells us how much work a program does per memory access

high intensity —> compute-bound —> GPU friendly

low intensity —> memory-bound —> suffers from low memory

HPC main approaches

Shared memory

OpenMP, pthreads, ..

threads run on multiple cores of same machine

data is shared implicitiyl

distributed memory

MPI

processes can be on different mchines/nodes

must explicitly send/recv data

is this safe? how fix

int x = 0;

#pragma omp parallel for

for (int i = 0; i < 4; i++) {

x += i;

}

#pragma omp parallel for reduction(+:x)

undeferred and deferrred tasks in OpenMP

Undeferred task

encountering task suspended and new task executed immediately

(by some thread)

deferred task

ecnountering task suspended and new task executed immediately

(by same thread)

Beitreten

Vorschau

Author

vi V.

Informationen

Zuletzt geändert
vor 4 Monaten

Kurs melden