CUDA?
Compute Unified Device Architecture
what are some operations?
__global__ = executed on GPU, invoked from host (CPU) cache, cannot be called from device (GPU)
__device__ = executed on GPU, called from other GPU function, cannot be called from host (CPU)
__host__ = only executed by CPU, called from host
What is the CUDA execution hierachy?
Stream: list of Grids that execute in-order
Grid: consists of 2^32 Thread Blocks
Thread Block: consists of 1024 CUDA threads
CUDA Thread: scalar execution context (individual worker)
CUDA Thread vs CPU Thread
CUDA not really a thread: single iteration in iteration space (grid) pf vectorizable loop
What is PGAS, name some languages and librarys.
Partitioned Global Address Space
-> shared data divided in local and remote parts
languages: Chapel, CoArray, …
librarys: Global Arrays (GA), GASPI, MPI3.0 RMA, …
What is UPC and basic forms of barriers?
extension to C, implementing PGAS model
basic form of barriers:
Barrier: block until all other thread arrive (upc_barrier)
Split-phase barrier: upc_notify, upc_wait
UPC Pointer?
Local
Shared
Private
P1
P2
P3
P4
int *P1 -> private pointer to local memory
shared int *P2 -> private pointer to shared space
int *shared P3 -> shared pointer to local memory (not recom)
shared int *shared P4 -> shared pointer to shared space
What is DASH and some implementations?
C++ template library
implementations:
DART-SHMEM: shared memory based
DART-CUDA: supports GPUs, based on DART-SHMEM
DART-GASPI: initital implementation, using GASPI
DART-MPI: MPI-3 RMA based ‘workhorse’ implementation
Zuletzt geändertvor 19 Tagen