Homework 1 solutions PDF

Title	Homework 1 solutions
Course	Computer Architecture
Institution	Oklahoma State University
Pages	9
File Size	196.5 KB
File Type	PDF
Total Downloads	7
Total Views	143

Preview

CLICK TO PREVIEW PDF

Summary

Dr. Stine...

Description

OSU ECEN 4243 Computer Architecture, Spring 2018 HW 1: Introduction to Computer Architecture Principles Solutions Instructor: James E. Stine, Jr. TA: Rachana Erra Assigned: Monday, 1/22, 2018 Due Monday 2/19, 2018 (midnight) Handin: http://online.okstate.edu

Please complete the following assigned problems from our textbook Computer Organization and Design: ARM Edition by John Hennessy and David Patterson [1]. Please make sure your hw is submitted as a PDF using the D2L DropBox; there are scanners available in the Edmon Low library, if needed. Throughout this homework and other homeworks, I will use engineering units where appropriate. • Problem 1.6: This problem uses the basic equation for execution and CPI. P1 has a 2.5 GHz clock (400ps cycle time) and P2 has a 3.0 GHz clock (333ps cycle time). Class A B C D Total

Instructions 105 2 × 105 5 × 105 2 × 105 1 × 106

Frequency 10% 20% 50% 20% 100%

Part (a)

t = IC · CP I · Tc tP 1 = ((105 · 1) + (2 · 105 · 2) + (5 · 105 · 3) + (2 · 105 · 3)) · (400 · 10−12 ) = 1.04 msec tP 2

=

((105 · 2) + (2 · 105 · 2) + (5 · 105 · 2) + (2 · 105 · 2)) · (333 · 10−12 ) = 666.0 µsec

To find the CPI, alegebraically solve for CPI CP I CP IP 1

= t ∗ 1/(IC · Tc ) = 1.04 msec · 1/(106 · 400 ps) = 2.6

CP IP 2

= 666 µsec · 1/(106 · 333 ps) = 2.0

Part (b)

cycles

=

IC · CP I

cyclesP 1

=

((105 · 1) + (2 · 105 · 2) + (5 · 105 · 3) + (2 · 105 · 3)) = 2.6 × 106

cyclesP 2

=

((105 · 2) + (2 · 105 · 2) + (5 · 105 · 2) + (2 · 105 · 2)) = 2.0 × 106

Many ways of finding which one is faster, but since both operate at different frequencies, its easier to compare execution times. From part (a), Program P2 is faster! And, its speedup is 1.56 or 35.96% faster! LATEX

1/9

• Problem 1.9 : If you have not had me before, you know that good engineering tools are helpful. I typically use MATLAB very often for much of my work, because it helps engineers organize their data well. And, I have the MATLAB code I wrote for the answers below in Figure 1. – (1.9.1) # p 1 2 4 8

Arith Inst. CPI 2.56E9 1 1.83E9 1 914.29E6 1 457.14E6 1

L/S Inst. CPI 1.28E9 12 914.29E6 12 457.14E6 12 228.57E6 12

B Inst. CPI 256E6 5 256E6 5 256E6 5 256E6 5

Cycles 19.20E9 14.08E9 7.68E9 4.48E9

Execution Freq [GHz] Time [sec] 2.0 9.60 2.0 7.04 2.0 3.84 2.0 2.24

Speedup 1.00 1.36 2.50 4.29

B Inst. CPI 256E6 5 256E6 5 256E6 5 256E6 5

Cycles 21.76E9 15.91E9 8.59E9 4.94E9

Execution Freq [GHz] Time [sec] 2.0 10.88 2.0 7.04 2.0 3.84 2.0 2.24

Speedup 1.00 1.55 2.83 4.86

– (1.9.2) Change CPI for arithmetic to 2 # p 1 2 4 8

Arith Inst. CPI 2.56E9 2 1.83E9 2 914.29E6 2 457.14E6 2

L/S Inst. CPI 1.28E9 12 914.29E6 12 457.14E6 12 228.57E6 12

– (1.9.3) This problem asks what the CPI would be for the Load/Store instruction if it was comparable to the same equations for p = 4. This just involves some simple alegebra: (ICarith · CP Iarith + ICL/S · CP IL/S + ICB · CP IB )p=4 − (ICarith · CP Iarith + ICB · CP IB )p=1 =3 (ICL/S )p=1

LATEX

2/9

• Problem 1.13: This is an important computer architecture point is that you cannot just improve one aspect of an instruction unless it is the most common case and it occurs very frequently. Assuming a program takes 250 sec to execute with 70 sec spent executing Floating-Point (FP) instrucitons, 85 sec executing Load/Store (L/S) instructions and 40 sec executing branch instructions. – (1.13.1) By how much is the total time reduced if the time for FP operations is reduced by 20%. told = 55int + 70f p + 85l/s + 40b = 250 tnew = 55 + 70 · 0.80 + 85 + 40 = 236 Speedup = 250/236 = 1.0593 (5.60%) – (1.13.2) By how much is the time for INT operations reduced if the total time is reduced by 20%? tnew = 250 · 0.80 = 200 tint = 200 − (70f p + 85l/s + 40b ) · 0.80 = 44 Speedup = 55/44 = 1.2500 (20%) – (1.13.3) Can the total time can be reduced by 20% by reducing only the time for branch instructions? tnew = 55int + 70f p + 85l/s + xb = 210 + xb tnew = 210 + xb = (250 · 0.8) tnew = 10 = −xb Obviously, you cannot have a negative time for branches, therefore, the answer is NO.

LATEX

3/9

1 2 3 4 5

% % ECEN 4243 % COD HW 1 % Prob 1.9 %

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

clear; clc format long eng % IC A = 2.56E9; L = 1.28E9; B = 256E6; % CPI A CPI = 1; L CPI = 12; B CPI = 5; % Frequency [GHz] freq = 2E9; n = (A(1)* A CPI + L(1)* L CPI + B(1)* B CPI); t = n(1) * (1/freq); speedup = 1.00; % # processors p = [1 2 4 8]; % New cycle time/time for different processors (1.9.1) for i = 2:4 A = [A (A(1) / (p(i) * 0.7))]; L = [L (L(1) / (p(i) * 0.7))]; B = [B B(1)]; n = [n ((A(i)* A CPI + L(i)* L CPI + B(i)* B CPI))]; t = [t (n(i) * (1/2E9))]; speedup = [speedup t(1)/t(i)]; end % Change CPI arithmetic x2 (1.9.2) A2 CPI = 2; n2 = (A(1)*A2 CPI + L(1)* L CPI + B(1)* B CPI); t2 = n2(1) * (1/freq); speedup2 = 1.0; for i = 2:4 n2 = [n2 ((A(i)*A2 CPI + L(i)* L CPI + B(i)* B CPI))]; t2 = [t2 (n(i) * (1/2E9))]; speedup2 = [speedup2 t2(1)/t2(i)]; end % (1.9.3) ((A(3)* A CPI + L(3)* L CPI + B(3)* B CPI) - (A(1)* A CPI + B(1)* B CPI))/L(1)

Figure 1: MATLAB code for Problem 1.9

LATEX

4/9

• Problem 1.14 : Assume a program requires the execution of 50 × 106 FP instructions, 110 × 106 INT instructions, 80 × 106 L/S instructions, and 16 × 106 branch instructions. The CPI for each type of instruction is 1, 1, 4, and 2, respectively. Assume that the processor has a 2 GHz clock rate. – (1.14.1) By how much must we improve the CPI of FP instructions if we want the program to run two times faster? This actually took me some to figure out what they wanted here. I interpreted this as trying to reduce only FP instructions while keeping the others static. cycles

=

CP IF P · ICF P + CP IIN T · ICI N T + CP IL/S · ICL/S + CP IB · ICB = 512 × 106

tcpu

=

cycles · (500 × 10−12 ) = 256 × 10−3

Now the fun part : to have the number of clock cycles by improving the CPI of FP instructions: cycles CP IF P,impr · ICF P + CP IIN T · CP II N T · ICIN T + CP IL/S · ICL/S + CP IB · ICB = 2   1 cycles − (CP IIN T · ICI N T + CP IL/S · IC L/S + CP IB · ICB · CP IF P,impr = ICF P 2 Just like previously, I like using MATLAB for computations and plus it puts things in engineering units, which I think all engineers should use because it keeps a common language. Figure 2 shows the MATLAB code. I did not show the other MATLAB code for the other parts, because it is extremely similar. This equation produces CP IF P,impr = −4.12 which is impossible, so this is not possible. – (1.14.2) By how much must we improve the CPI of L/S instructions if we want the program to run two times faster? Same idea, but using Load/Store. cycles CP IF P · ICF P + CP IIN T · CP II N T · ICIN T + CP IL/S,impr · ICL/S + CP IB · ICB = 2   cycles 1 CP IL/S,impr = − (CP IF P · ICF P + CP IIN T · CP II N T + +CP IB · ICB · ICL/S 2 This is realizable as opposed to (1.14.1) and gives CP IL/S,impr = 0.80 – (1.14.3) By how much is the execution time of the program improved if the CPI of INT and FP instructions is reduced by 40% and the CPI of L/S and Branch is reduced by 30%? Using the same equations and reducing CPI for INT, FP, L/S and Branch results in: cycles

=

0.60 · CP IF P · ICF P + 0.60 · CP IIN T · CP II N T · ICIN T + 0.70 · CP IL/S · ICL/S + 0.70 · CP IB · ICB = 342.4 × 106

tcpu

=

Speedup =

LATEX

cycles · (500 × 10−12 ) = 171.2 × 10−3 0.256/0.1712 = 1.4953 (33.125%)

5/9

% % ECEN 4243 % COD HW 1 % Prob 1.14.1 % clear;clc format long eng

1 2 3 4 5 6 7 8

IC fp = 50E6; IC int = 110E6; IC ls = 80E6; IC br = 16E6;

9 10 11 12 13

CPI CPI CPI CPI

14 15 16 17

fp = 1; int = 1; ls = 4; br = 2;

18

freq1 = 2E9; % original t1 = (IC fp * CPI fp + IC int * CPI int + IC ls * CPI ls + ... IC br * CPI br) * (1/freq1) cycles1 = (IC fp * CPI fp + IC int * CPI int + IC ls * CPI ls + ... IC br * CPI br) % run 2x faster freq2 = 4E9; CPI fp2 = ((cycles1/2) - (IC int * CPI int + IC ls * CPI ls + ... IC br * CPI br)) * (1/IC fp)

19 20 21 22 23 24 25 26 27 28

Figure 2: MATLAB code for Problem 1.14.1 • Problem 2.1 : For the following C statement, write the corresponding LEGv8 assembly code. Assume that the C variables f, g, and h, have already been placed in registers X0, X1, and X2 respectively. Use a minimal number of LEGv8 assembly instructions. f = g + (h - 5);

For the class and labs discussion, we mainly deal with the ARM nomenclature (called Arm4 architecture or sometimes arm7tdmi - we have been calling it LEGv8 in class). For the COD book they discuss a more recent version of the ARM architecture. There is very little difference between what we discuss in class and the COD book except for changes in register names and instruction mnemonics. addi add

f, h, -5 @ (note, no subi) f, f, g

• Problem 2.2 : Write a single C statement that corresponds to the two LEGv8 assembly instructions below. ADD f, g, h ADD f, i, f

This should easily be seen as: f

LATEX

=

g+h+i

6/9

• Problem 2.3 : For the following C statement, write the corresponding LEGv8 assembly code. Assume that the variables f, g, h, i, and j are assigned to registers X0, X1, X2, X3, and X4, respectively. Assume that the base address of the arrays A and B are in registers X6 and X7, respectively. B[8] = A[i-j];

This one is subjective and here is one solution I wrote: SUB LSL ADD LDUR STUR

X9, X3, X4 X9, X9,#3 X11, X6, X9 X10, [X11, #0] X10, [X7, #64]

@ @ @ @ @

compute i-j multx8 to conv word offset to byte offset compute &A[i-j] load A[i-j] store in B[8]

• Problem 2.10 : For each LEGv8 instruction in Exercise 2.9, show the value of the opcode (Op), source register (Rn), and target register (Rd or Rt) fields. For the I-type instructions, show the value of the immediate field, and for the R-type instructions, show the value of the second source register (Rm). Instruction ADDI X9, X6, #8 ADD X10, X6, XZR STUR X10, [X9, #0] LDUR X9, [X9, #0] ADD X0, X9, X10

LATEX

Type I-type R-type D-type D-type R-type

opcode 580/0x244 1112/0x458 1984/0x7c0 1986/0x7c2 1112/0x458

rm – 31 – – 10

rn 6 6 9 9 9

rd/rt 9 10 10 9 0

imm/addr 8 – 0 0 –

7/9

• Problem 2.18 : Assume the following register contents: X10 = 0x00000000AAAAAAAA, X11 = 0x1234567812345678 – (2.18.1) For the register values shown above, what is the value of X12 for the following sequence of instructions? LSL X12, X10, #4 ORR X12, X12, X11 This should look like our instructions from class except the 64-bit data. The answer should be: 0x1234567ababefef8 – (2.18.2) For the register values shown above, what is the value of X12 for the following sequence of instructions? LSL X12, X11, #4 Again, another easy one: the answer should be 0x2345678123456780 – (2.18.3) For the register values shown above, what is the value of X12 for the following sequence of instructions? LSR X12, X10, #3 ANDI X12, X12, 0xFEF Again, another easy one: the answer should be: 0x0000000000000545 • Problem 2.22 : Assume X0 holds the value 00000000000101000. What is the value of X1 after the following instructions? cmp x0, #0 b.ge else b done else: orri x1, xzr, #2 done:

Another easy one by interpreting branches correct. The answer should be: X1 = 2. • Problem 3.9 : Assume 151 and 214 are signed 8-bit decimal integers stored in twos complement format. Calculate 151+214 using saturating arithmetic. The result should be written in decimal. Show your work. Two’s complement arithmetic is hard to understand initially, but we have seen in class that its power is that it works easily for addition and subtraction. For this problem, just compute the answer according to the two step sequence discusssed in class and learned in earlier classes. Although calculators are extremely useful, they really, in my opinion, hamper the process in learning this material. So, I highly advise understanding how to convert to/from two’s complement and its associcated arithmetic by hand on paper. Binary numbers include underscores to help read the number more easily. An important aspect is that I have to tell you it is a two’s complement number otherwise it could be read as an unsigned number.

LATEX

8/9

Value 1 2 1+2

Decimal 151 = −105 214 = −42 −147 = +109

Binary 1001_0111 1101_0110 0110_1101

Hex 97 D9 6D

As you can see from the table above, these numbers 151 and 214 would be interpreted as negative numbers −105 and −42, respectively. Therefore, if you added the numbers together (i.e., −105 − 42 would result in −147) you would get an overflow. As explained in class, one of the side effects of two’s complement number is that it is asymetrical. For 8-bit two’s complement numbers, the range is [−128, +127] and this result would obviously overflow as mentioned previously. Therefore, the answer if using saturating arithmetic would saturate it to the maximum negative number which is −128. • Problem 3.14 : Calculate the time necessary to perform a multiply using the approach given in Figures 3.3 and 3.4 if an integer is 8 bits wide and each step of the operation takes four time units. Assume that in step 1a an addition is always performedeither the multiplicand will be added, or a zero will be. Also assume that the registers have already been initialized (you are just counting how long it takes to do the multiplication loop itself). If this is being done in hardware, the shifts of the multiplicand and multiplier can be done simultaneously. If this is being done in software, they will have to be done one after the other. Solve for each case. Although addition and subtraction work quite well in two’s complement arithmetic, traditional multiplication and division are quite difficult to implement. This problem hopefuly illustrates this. For hardware, it takes one cycle to do the add, one cycle to do the shift, and one cycle to decide if we are done. So the loop takes (3 × A) cycles, with each cycle being B time units long. For a so ware implementation, it takes one cycle to decide what to add, one cycle to do the add, one cycle to do each shi , and one cycle to decide if we are done. So the loop takes (5 × A) cycles, with each cycle being B time units long.

hw

: (3 × 8) × 4 = 96

sw

: (5 × 8) × 4 = 160

References [1] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The Hardware Software Interface ARM Edition. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1st ed., 2016.

LATEX

9/9...