GPU Computing PDF

Title	GPU Computing
Author	All Makhtabekov
Course	Probability and Random Processes
Institution	Queen's University
Pages	42
File Size	3.7 MB
File Type	PDF
Total Downloads	101
Total Views	147

Preview

CLICK TO PREVIEW PDF

Summary

GPU Computing with Probability Applications...

Description

Section 3: GPU Computing Classes of Parallelism: -

there are two kinds of parallelism in applications: 1) task-level, where tasks of work and instructions are created that can operate independently and in parallel 2) data-level, where there are many data items that can be operated on at the same time

Flynn’s taxonomy still works at the coarse level and exploits application parallelism

Task-level multiprocessor architecture: a) there are tightly coupled shared-memory multiprocessors based on a multicore chip (symmetric multiprocessors) - it’s using the shared-memory MIMD model - the shared memory has uniform memory latency - multiple threads/processes work together to solve a problem - you can run multiple independent jobs or processes - the grain size is the amount of computation assigned to each thread

b) -

distributed shared-memory (DSM) multiprocessors the memory is distributed among processors, using the shared-memory MIMD model memory access/latency is not uniform the processors are connected via direct (switched) and non-direct (multi-hop) interconnection networks

c) Message-passing multiprocessors (Clusters and MPPs) - this is made up of loosely-coupled multiprocessors - some of the problems that need higher performance can be handled by using the cluster, a set of independent servers/PCs connected over a fast network functioning as a single large multiprocessor - each machine has its own address space, thus memory is distributed - it’s using message-passing parallelism, no memory is shared - hybrid paradigms are the preferred technology to use (MPI-OpenMP, MPI-PGAS, MPI-CUDA, MPI-OpenCL)

Challenges in Task-level Parallel Processing: -

-

-

scalability, we would like to achieve high performance on shared-memory or message-passing multiprocessors as the number of processors increases, however scalability is a key challenge in developing efficient parallel programs other factors that affect performance are : partitioning, communication cost, synchronization cost, scheduling, load balancing and creating efficient algorithms

Amdahl’s Law states that if there are even a small amount of sequential operations in a given program, then the speedup can be significantly limited Gustafson’s Law alleviates this problem by consider larger workloads when scaling up the system

Data-Level Parallelism: -

SIMD architectures can exploit significant data-level parallelism (DLP) for: matrix-oriented scientific computing and media-oriented image and sound processors - SIMD is more energy efficient than MIMD because it only needs to fetch one instruction per data operation - SIMD allows for the programmer to think sequentially - in data level parallelism, there are 3 key components: vector architectures, SIMD extensions and Graphics Processor Units (GPUs) a) Vector Architectures: - a signal instruction operates on vectors of data, which results in dozens of register-register operations on independent data elements - we read sets of data elements into vector registers, performs operations on those registers, and disperse the results back into memory - the registers are controlled by the compiler, which is used to hide memory latency and leverage the memory bandwidth - each register holds a 64-element, 64 bits per element vector - the register file has 16 read ports and 8 write ports

-

there are multiple lanes, meaning there’s beyond one element per clock cycle using multiple function units to improve the performance of a single vector add instruction, C = A+B here’s a single add pipeline completing one addition per cycle:

-

in comparison, a 4-lane pipeline that has for add pipelines, meaning it can complete 4 additions per cycle the elements within a signal vector add instruction are interleaved across the 4 pipelines

SIMD Extensions for Multimedia: -

-

SIMD multimedia extensions started with the observation that many media applications such as audio and early graphics systems operate on narrower data types than the 32-bit processors are optimized for by partitioning a 256-bit adder, the processor could perform simultaneous operations on short vectors of 32 8-bit operands, 16 16-bit operands, 8 32-bit operands, or 4 64-bit operands multimedia extensions fix the number of data operands in the opcode, which meant many instructions could be added in extensions of the x86 architecture however, Multimedia SIMD doesn’t offer the more sophisticated modes of vector architectures, namely strided accesses and gather-scatter accesses. multimedia SIMD doesn’t offer the mask registers to support conditional execution of elements as in vector architectures.

NVIDIA GPU ARCHITECTURES AND FEATURES: GPU: -

given the hardware invested to do graphics well, NVIDIA asked how can we supplement it to improve the performance of a wider range of applications the basic idea is a heterogeneous execution model where the CPU is the host and the GPU is the device there’s a C-like programming language for GPU every form of a GPU parallelism is a CUDA thread this form of programming model is Single Instruction Multiple Thread

similarities to vector machines: -

it works well with data-level parallel problems there exists scatter-gather transfers, mask registers and large register files

differences to vector machines: -

there’s no scalar processor uses multithreading to hide memory latency has many function units, as opposed to a few deeply-pipelined units like in a vector processor

Design Philosophy:

GPU Architectures: for a CUDA-capable GPU, there’s an array of highly threaded streaming multiprocessors

global memory: functions as system memory for computation, also as frame buffer memory and holding video images and texture information for 3D rendering

Nvidia Kepler new features: dynamic parallelism: allows a CUDA kernel to create new grids of threads by invoking new kernels. Before this was established, launching anew kernel required the host to terminate the current kernel on the GPU in order to invoke a new kernel. Applications using recursion or irregular loop structure benefit HyperQ: enables multiple CPU cores to launch work on a single GPU simultaneously, thereby increasing GPU utilization and reducing CPU idle times

GPUDirect RDMA: capability that enables GPU’s within a node, or GPUs in different nodes, to directly exchange data without needing to go to CPU/system memory. This allows devices such as SSDs, NICs and IB adapters to directly access memory on multiple GPUs within the same, and therefore decreasing the latency of MPI send and receive messages to/from GPU memory

Nvidia Pascal new features: NVLink: high speed, high bandwidth interconnect for max application scalability. The GPU-to-GPU data transfers are up to 40GB/s of bidirectional BW and more than 32 GB/s for PCIe Gen 3x16

HBM2: fast, high capacity, efficient stacked memory architecture. It offers 3x the memory BW of the Maxwell GPU, this allows the P100 to tackle much larger working sets of data at higher BW, thus improving efficiency and computational throughput, and reducing the frequency of transfers from system memory Unified Memory: provides a single, unified virtual address space for CPU and GPU memory, thus simplifying GPU programming and porting of applications to GPUs. Meaning, we no longer need to worry about managing data sharing between 2 different virtual memory systems. Compute Preemption: allows compute tasks to be pre-empted at instruction-level granularity, rather than thread block granularity. Preventing long-running applications from monopolizing the system. Programmers no longer need to modify their long-running applications to co-exist well with other GPU applications.

Nvidia Volta new features a) it introduces an entirely new streaming multiprocessor (SM) Architecture, including: 6 GPU processing clusters, GPCs, each having 14 SMs optimization for deep learning: up to 672 new tensor cores deliver up to 12x higher peak TFLOPS for training and 6x higher peak TFLOPS for inference 50% more energy efficient than Pascal Independent parallel integer and floating-point data paths new independent thread scheduling capability enables finer-grain synchronization and cooperation between parallel threads

-

-

-

new combined L1 data cache and shared memory unite significantly improves performance while also simplifying programming the Tensor Cores allow for matrix-matrix multiplication (GEMM) operations, which are the core of neural network training and inferencing, and are used to multiple large matrices of input data and weights in the connected layers of the network each Tensor Core operates on a 4x4 matrix and performs the operation: D=AxB+C D, A, B, C are 4x4 matrices, the matrix multiply inputs A and B are FP16 matrices, while the accumulation matrices C and D are either FP16/FP32 matrices

the CUDA API exposes specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently use Tensor Cores cuBLASS and cuDNN libraries have been updated to use Tensor Cores for deep learning applications and frameworks, Caffe2, MXnet

b) Second-gen NVLink , the V100 provides higher link speeds (25GB/s), more links per GPU(4 to 6), CPU mastering, cache coherence and scalability

c) Faster HBM2: the 16GB HBM2 memory subsystem delivers 900GB/s peak memory BW. The combination of both a new generation HBM2 memory from Samsung and a new gen memory controller in Volta provides MUCH more deliver memory BW

d) Multi-process Service (MPS): providing hardware acceleration of critical components of the CUDA MPS server, enabling improved performance and better quality of service for multiple compute applications sharing the GPU, having 3x the max number of MPS clients: 48 e) Cooperative Groups and new Cooperative Launch APIs: a new programming model introduced in CUDA 9 for organizing groups of communicating threads to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions. Basic Cooperative Groups functionality is supported on all NVIDIA GPUs since Kepler. Pascal and Volta include support for new cooperative launch APIs that support synchronization amongst CUDA thread blocks or groups of threads, Volta adds support for new synchronization patterns. f) Copy Engine Enhancements: before, the copy engines required both source and destination memory regions to be pinned. The Volta copy engines can generate page faults for addresses that aren’t mapped into the page tables. The memory subsystem can then service the page faults, mapping the addresses into the page table, after which the copy engine can perform the transform. This is important for large multi-GPU/multi-CPU systems, because pinning memory for multiple copy engine operations between multiple processors can reduce available memory g) Volta Optimized Software: new versions of deep learning frameworks, Caffe2,MXNet, CNTK, TensorFlow for faster training times and higher multi-node training performance. Voltaoptimized versions of GPU accelerated libraries such as cuDNN, cuBLAS, and TensorRT leverage the new features of the architecture to deliver better performance for both deep learning inference and HPC applications in CUDA toolkit version 9.0 Introduction to Data Parallelism and CUDA: -

CPUS are better for sequential applications as well as task-level applications a hybrid CPU/GPU domain was created by Nvidia CUDA (Compute Unified Device Architecture) programming model this demand is reflected in programming models OpenCL, OpenACC, C++AMP CUDA kernels can be used to speed up the time-consuming part of the code

some of the challenges of scalability: -

time complexity data locality minimal contention for shared resources load balance high memory bandwidth performance tuning

CUDA API introduces: -

-

a more generic parallel programming model based on C with a hierarchy of parallel threads, barrier synchronization and atomic operations to dispatch and manager highly parallel computing work higherarical thread organization explicit GPU memory managing interface for launching parallel execution

-

thread indices to data index mapping

Data Level Parallelism: Vector Addition objective is to learn the key concepts in writing parallel programs in heterogeneous computing systems, here is a vector addition example:

Execution of a CUDA C Program: -

each CUDA source file can have a mixture of host and device code when a kernel function (device code) is launched, it’s executed by a large number of threads on a device, the GPU the execution of each thread is sequential but the threads are run in parallel a thread consists of program code, its own program counter, values of variables and data structures

grid: all the threads that are generated by a kernel launch

-

the compilation process of CUDA program:

CUDA Memory Model and Management: host code: can transfer data to/from per grid global memory in the device device code: can read and write from/to per-thread registers and can read and write from/to all-shared global memory in the block

cudaMalloc() – allocates object in the device global memory, with 2 parameters: address of pointer to the allocated object size of allocated object in bytes cudaFree() – frees object from device global memory with parameter pointer to freed object

cudaMemcpy()- facilitates a memory data transfer, with 4 parameters: pointer to destination pointer to source number of bytes copied type of transfer (host to host, host to device, device to host, device to device) -

it can’t be used to copy between different GPUs in multi-GPU systems

CUDA kernel function and threading: a CUDA kernel function specifies the code to be executed by all threads during a parallel phase -

the CUDA runtime system creates a grid of threads that are organized in a two-level each grid is organized into an array of blocks, and all blocks of a grid are of the same number of threads (up to 1024) the number of threads in each block is specified during kernel launched, and is available in the blockDim variable each thread in a block has a unique threadIdx value each thread has a unique global ID by coming its threadIDx and blockIdx a data element with index i can be addressed by: i = blockIdx.x * blockDim.x + threadIDx.x .x implies that we have .y and .z

a CUDA kernel is executed by a grid (array) of threads: -

all threads in a grid run the same code, known as SPMD (single program multiple data programming model) each thread has an ID that it uses to compute memory addresses and make control decisions

divide the thread array into multiple blocks: -

threads within a block cooperate via shared memory, atomic operations and barrier synchronization threads in different blocks cannot cooperate, however with Pascal and Volta GPU Cooperative groups, it enables synchronization of thread groups that span an entire kernel launch running on one or even multiple GPUs

BlockIdx and ThreadIdx: -

each thread uses indices to understand what part of data it has to work on. each variable can be 1D, 2D, 3D it’s helpful for memory addressing when we are processing multidimensional data

CUDA Function Declaration:

Data-Parallel Execution Model the CUDA thread organization: -

-

fine-grained, data-parallel threads are the fundamental means of parallel execution the threads launched by kernel can be organized in a 2-level hierarchy: a grid consists of 1 or more blocks, and each block consists of one or more threads all threads in a block share the same block index, which can be accessed as the blockIDx variable in a kernel. each thread also has a thread index, which can be accessed as the threadIdx variable in kernel when a thread executes a kernel function, references to these 2 variables return the coordinates of the thread the execution configuration parameters in a kernel launch statement specifiy the dimensions of the grid and the dimensions of each block in number of threads in general, a grid is a 3D array of blocks and each block is a 3D array of threads each parameter has type dim3, which is a C struct with 3 unsinged integer fields, x y z. for 1D/2D grids and blocks, the unused dimension is set to 1

Ex) suppose we launch the vecAddKernel() kernel in the host code to generate a 1D grid that consists of 128 blocks, each of which consists of 32 threads: dim3 dimGrid(128, 1, 1); //128 blocks dim3 dimBlock(32, 1, 1); //32 threads per block vecAddKernel(d_A, d_B, d_C, n);

-

for ID grids and blocks, instead of dim3, CUDA C compiler allows using arithmetic expressions, it assumes that y and z dimensions are 1: vecAddKernel(d_A, d_B, d_C, n);

Ex) dim3 dimGrid(2, 2, 1); //4 blocks dim3 dimBlock(4, 2, 2); //16 threads per block each block is labeled with: (blockIdx.y, blockIdx.x) each thread is labeled with: (threadIdx.z, threadIdx.y, threadIdx.x)

Mapping Threads to Multidimensional Data: suppose we want to process a picture with a 2D grid -

we have the choice of 1D, 2D, 3D thread organization and it’s based on the nature of data it’s often convenient to use a 2D grid that consists of 2D blocks to process the pixels in a picture

assume that 16x16 block used, with 16 threads in x and 16 in y directions

for a 76x62 picture, we will need 5 blocks in the x direction and 4 blocks in the y direction

-

the shaded area depicts the threads that cover the pixels the red pixel element processed by thread (0, 0) of block (1,0) by formula:

-

we have 4 extra threads in the x direction and 2 extra threads in the y direction the picture processing kernel function will have if statements, similar to vecAddKernel, to test whether the threads indexes threadIdx.x and threadIdx.y fall within the valid range of pixels

-

-

assume that the host code uses an int, m, to track the number of pixels in the x direction and another int, n, to track the number of pixels in the y direction also assume that the input picture data has been copied to the device memory and can be accessed through a pointer var, d_Pin, and output picture pointer var, d_Pout the following host code can be used to launch a 2D kernel colourToGreyscaleConversion to process the picture

if each color pixel is described by a triple (R, G, B) of intensities for red, green, and blue, we use luminosity method to convert each color pixel to its grayscale counterpart the luminosity method forms a weighted average to account for human perception, we’re more sensitive to green than other colors, thus green is weight most heavily

L = 0.21R + 0.72G + 0.07B -

must understand how C statements access elements of dynamically-allocated multidimensional arrays in C, all multidimensional arrays are linearized b/c of the use of a flat memory space, in the case of 2D, C uses a row-major layout:

the ID index for

-

the following figure illustrates the execution of colorToGreyscaleConversion() Kernel over our 76 x 62 picture example, Launching the kernel will generate 80 x 64 threads, there are 4 different areas:

Area 1: covers those threads that are within range Area 2: the Row values of these threads are within the range, but their Col values of 4 threads exceeed the n value (76). Such threads will not process any pixels Area 3: the Col values of these threads are within range, but 2 threads in each column will fail the if statement A...