Solutions Assignment #3 PDF

Title Solutions Assignment #3
Course high performance computing architecture
Institution Memorial University of Newfoundland
Pages 15
File Size 195.9 KB
File Type PDF
Total Downloads 13
Total Views 136

Summary

assignment solutions 3...


Description

ENGR9861 High Performance Computer Architecture Assignment #3 Solution :: RV & AO Jul. 31, 2019

Question 0

The following is a program written in RV64V. Carefully go through the code and

concisely describe what is being accomplished here. Please do not write an instruction-by-instruction explanation. Remember that the vector length register (vl) is defaulted to the maximum vector length, that is, the number of elements in each vector register. In this problem mvl is 64. Calculate the arithmetic intensity. (8 marks) vsetdcfg

6 SP FP

# six 32-bit FP vector registers

add

x1, x0, x0

# initialize x1 to 0

add

x2, x0, 1200

# loop count n

x3, x2

# vl = x3 = min(mvl, n)

vld

v1, a_re+x1

# load a_re (memory address is a_re + contents of x1)

vld

v3, b_re+x1

# load b_re

vmul

v5, v1,v3

# a_re*b_re

vld

v2, a_im+x1 # load a_im

vld

v4, b_im+x1 # load b_im

vmul

v6, v2, v4

# a_im*b_im

vsub

v5, v5, v6

# a_re*b_re – a_im*b_im

vst

v5, c_re+x1

# store c_re

vmul

v5, v1, v4

# a_re*b_im

vmul

v6, v2, v3

# a_im*b_re

vadd

v5, v5, v6

# a_re*b_im + a_im*b_re

vst

v5, c_im+x1 # store c_im

slli

x4, x3, 2

# x4 = vl * 4 (in bytes)

add

x1, x1, x4

# update the pointer to memory addresses

sub

x2, x2, x3

# update remaining count: n -= vl

bgt

x2, x0, loop

# repeat if n>0

loop: setvl

vdisable

(a) The code peforms the following: for (i=0; i0

loop: setvl

vdisable

Question 4

A vector processor consists of 32 vector registers, each of which holds 64 elements;

each element is 64 bits wide. The processor contains two vector load units, one vector store unit, one vector FP add/sub unit, one vector FP multiply unit, one vector FP divide unit, one integer unit and one logical unit. All functional units can work with DPFP numbers, are fully pipelined, and employ two-lane architecture. Each vector as well as scalar register is connected to each functional unit using a crossbar network. Determine the number of convoys, chaining required in each convoy, chime and computation time (in clock cycles) for the DAXPY (i.e. double-precision Y = aX+Y) problem where the operand vectors have 2000 elements each. (12 marks) The RV64V code for DAXPY vector related computation section vld vmul vld vadd vst

v0 , x5 v1, v0, f0 v2, x6 v3, v1, v2 v3, x6

#load vector X, assuming starting address of X is in x5 #vector-scalar multiply, assuming scalar in f0 #load vector Y, assuming starting address of Y is in x6 #vector-vector addition #store result

Laying out the above instructions of question 4 into convoy yields: Convoy 0: Convoy 1:

vld vld

vmul vadd vst

Table 4.1; Vector architecture for one of the architecture lane showing start of pipeline for each data Clk

load 0

0

v0(i)

1

v0(ii)

v0(i)*f0= v1(i)





……………..

7

v0(viii)

v2(i)

v0(vii)*f0=v1(vii)

8

v0(ix)

v2(ii)

v0(viii)*f0=v1(viii)

v1(i) + v2(i) = v3(i)







…………….

…………..

12

v0(xiii)

v2(vi)

v0(xii)*f0=v1(xii)

v1(v) + v2(v)=v3(v)

v3(i)







……………….

………….



32

end of load

v2(xxvi)

v0(xxxii)*f0=v1(xxxii)

v1(xxv) + v2( xxv)=v3(xxv)

v3(xxi)







………………

……………..



38

v2(xxxii)

last cycle of mul task

v1(xxxi) + v2(xxxi)=v3(x xxi)

v3(xxvii)

39

end of load

end of mul task

v1(xxxii) + v2(xxxii)=v3(xxxii)

v3(xxviii)

………………………….

……………………………….

……….

42

last cycle of add task

v3(xxxi)

43

end of add task

v3(xxxii)

44

load 1

mul

add

store

end of store

It takes 44 clock cycles to complete the vectorized section of the loop in the code. The number of times a loop will be called = n / vector length = 2000/64 = 31.25; this gives the number of times the loop will be called as 32 because the fraction 0.25 should be approximated to 1. The total number of clock cycle = 44 * 32 = 1408 cycles.

Question 5

Problem 4.12 in the textbook. (12 marks)

a)

Reads 40 bytes and writes 4 bytes for every 8 FLOPs, thus 8/44 FLOP/byte.

b)

This code performs indirect references through the Ca and Cb arrays, as they are indexed using the contents of the IDx array, which can only be performed at runtime. While this complicates SIMD implementation, it is still possible to perform the type of indexing using gather-type load instructions. The innermost loop (iterates on z) can be vectorized: the values for Ex, dH1, dH2, Ca, and Cb could be operated on as SIMD registers or vectors. Thus, this code is amenable to SIMD and vector execution.

c)

Having an arithmetic intensity of 0.18, if the processor has a peak floating-point throughput > (30 GB/s)x(0.18 FLOPs/byte)=5.4 GFLOPs/s, then this code is likely to be memorybound, unless the working set fits well within the processor’s cache.

d)

The single precision arithmetic intensity corresponding to the edge of the roof is 85/4=21.25 FLOPs/byte.

Question 6

Problem 4.13 in the textbook. (10 marks)

a)

1.5 GHz * 0.80 * 0.85 * 0.70 * 10 cores * 32/4=57.12 GFLOPs/s

b)

Option 1: 1.5 GHz * 0.80 * 0.85 * 0.70 *10 cores * 32/2=114.24 GFLOPs/s (speedup=114.24/57.12=2) Option 2: 1.5 GHz * 0.80 * 0.85 * 0.70 * 15 cores * 32/4 = 85.68 GFLOPs/s (speedup=85.68/57.12=1.5) Option 3: 1.5 GHz * 0.80 * 0.95 * 0.70 * 10 cores * 32/4=63.84 GFLOPs/s (speedup=63.84/57.12=1.11) Option 1 is the best.

Question 7

Problem 4.14 in the textbook. (10 marks)

a)

A loop-carried dependency exists as GCD(2,4)=2 divides 5-4 = 1.

b)

Output dependencies S1 and S3 caused through A[i] Anti-dependencies S4 and S3 have an anti-dependency through C[i]

Re-written code for (i=0;i...


Similar Free PDFs