Title | Solutions Assignment #3 |
---|---|
Course | high performance computing architecture |
Institution | Memorial University of Newfoundland |
Pages | 15 |
File Size | 195.9 KB |
File Type | |
Total Downloads | 13 |
Total Views | 136 |
assignment solutions 3...
ENGR9861 High Performance Computer Architecture Assignment #3 Solution :: RV & AO Jul. 31, 2019
Question 0
The following is a program written in RV64V. Carefully go through the code and
concisely describe what is being accomplished here. Please do not write an instruction-by-instruction explanation. Remember that the vector length register (vl) is defaulted to the maximum vector length, that is, the number of elements in each vector register. In this problem mvl is 64. Calculate the arithmetic intensity. (8 marks) vsetdcfg
6 SP FP
# six 32-bit FP vector registers
add
x1, x0, x0
# initialize x1 to 0
add
x2, x0, 1200
# loop count n
x3, x2
# vl = x3 = min(mvl, n)
vld
v1, a_re+x1
# load a_re (memory address is a_re + contents of x1)
vld
v3, b_re+x1
# load b_re
vmul
v5, v1,v3
# a_re*b_re
vld
v2, a_im+x1 # load a_im
vld
v4, b_im+x1 # load b_im
vmul
v6, v2, v4
# a_im*b_im
vsub
v5, v5, v6
# a_re*b_re – a_im*b_im
vst
v5, c_re+x1
# store c_re
vmul
v5, v1, v4
# a_re*b_im
vmul
v6, v2, v3
# a_im*b_re
vadd
v5, v5, v6
# a_re*b_im + a_im*b_re
vst
v5, c_im+x1 # store c_im
slli
x4, x3, 2
# x4 = vl * 4 (in bytes)
add
x1, x1, x4
# update the pointer to memory addresses
sub
x2, x2, x3
# update remaining count: n -= vl
bgt
x2, x0, loop
# repeat if n>0
loop: setvl
vdisable
(a) The code peforms the following: for (i=0; i0
loop: setvl
vdisable
Question 4
A vector processor consists of 32 vector registers, each of which holds 64 elements;
each element is 64 bits wide. The processor contains two vector load units, one vector store unit, one vector FP add/sub unit, one vector FP multiply unit, one vector FP divide unit, one integer unit and one logical unit. All functional units can work with DPFP numbers, are fully pipelined, and employ two-lane architecture. Each vector as well as scalar register is connected to each functional unit using a crossbar network. Determine the number of convoys, chaining required in each convoy, chime and computation time (in clock cycles) for the DAXPY (i.e. double-precision Y = aX+Y) problem where the operand vectors have 2000 elements each. (12 marks) The RV64V code for DAXPY vector related computation section vld vmul vld vadd vst
v0 , x5 v1, v0, f0 v2, x6 v3, v1, v2 v3, x6
#load vector X, assuming starting address of X is in x5 #vector-scalar multiply, assuming scalar in f0 #load vector Y, assuming starting address of Y is in x6 #vector-vector addition #store result
Laying out the above instructions of question 4 into convoy yields: Convoy 0: Convoy 1:
vld vld
vmul vadd vst
Table 4.1; Vector architecture for one of the architecture lane showing start of pipeline for each data Clk
load 0
0
v0(i)
1
v0(ii)
v0(i)*f0= v1(i)
…
…
……………..
7
v0(viii)
v2(i)
v0(vii)*f0=v1(vii)
8
v0(ix)
v2(ii)
v0(viii)*f0=v1(viii)
v1(i) + v2(i) = v3(i)
…
…
…
…………….
…………..
12
v0(xiii)
v2(vi)
v0(xii)*f0=v1(xii)
v1(v) + v2(v)=v3(v)
v3(i)
…
…
…
……………….
………….
…
32
end of load
v2(xxvi)
v0(xxxii)*f0=v1(xxxii)
v1(xxv) + v2( xxv)=v3(xxv)
v3(xxi)
…
…
…
………………
……………..
…
38
v2(xxxii)
last cycle of mul task
v1(xxxi) + v2(xxxi)=v3(x xxi)
v3(xxvii)
39
end of load
end of mul task
v1(xxxii) + v2(xxxii)=v3(xxxii)
v3(xxviii)
………………………….
……………………………….
……….
42
last cycle of add task
v3(xxxi)
43
end of add task
v3(xxxii)
44
load 1
mul
add
store
end of store
It takes 44 clock cycles to complete the vectorized section of the loop in the code. The number of times a loop will be called = n / vector length = 2000/64 = 31.25; this gives the number of times the loop will be called as 32 because the fraction 0.25 should be approximated to 1. The total number of clock cycle = 44 * 32 = 1408 cycles.
Question 5
Problem 4.12 in the textbook. (12 marks)
a)
Reads 40 bytes and writes 4 bytes for every 8 FLOPs, thus 8/44 FLOP/byte.
b)
This code performs indirect references through the Ca and Cb arrays, as they are indexed using the contents of the IDx array, which can only be performed at runtime. While this complicates SIMD implementation, it is still possible to perform the type of indexing using gather-type load instructions. The innermost loop (iterates on z) can be vectorized: the values for Ex, dH1, dH2, Ca, and Cb could be operated on as SIMD registers or vectors. Thus, this code is amenable to SIMD and vector execution.
c)
Having an arithmetic intensity of 0.18, if the processor has a peak floating-point throughput > (30 GB/s)x(0.18 FLOPs/byte)=5.4 GFLOPs/s, then this code is likely to be memorybound, unless the working set fits well within the processor’s cache.
d)
The single precision arithmetic intensity corresponding to the edge of the roof is 85/4=21.25 FLOPs/byte.
Question 6
Problem 4.13 in the textbook. (10 marks)
a)
1.5 GHz * 0.80 * 0.85 * 0.70 * 10 cores * 32/4=57.12 GFLOPs/s
b)
Option 1: 1.5 GHz * 0.80 * 0.85 * 0.70 *10 cores * 32/2=114.24 GFLOPs/s (speedup=114.24/57.12=2) Option 2: 1.5 GHz * 0.80 * 0.85 * 0.70 * 15 cores * 32/4 = 85.68 GFLOPs/s (speedup=85.68/57.12=1.5) Option 3: 1.5 GHz * 0.80 * 0.95 * 0.70 * 10 cores * 32/4=63.84 GFLOPs/s (speedup=63.84/57.12=1.11) Option 1 is the best.
Question 7
Problem 4.14 in the textbook. (10 marks)
a)
A loop-carried dependency exists as GCD(2,4)=2 divides 5-4 = 1.
b)
Output dependencies S1 and S3 caused through A[i] Anti-dependencies S4 and S3 have an anti-dependency through C[i]
Re-written code for (i=0;i...