Tut 10 floating point solutions PDF

Title	Tut 10 floating point solutions
Author	Ali Salimy
Course	Digital and Programmable Systems 2
Institution	Glasgow Caledonian University
Pages	2
File Size	172.2 KB
File Type	PDF
Total Downloads	102
Total Views	141

Preview

CLICK TO PREVIEW PDF

Summary

floating point tutorial answers...

Description

M3H623544 Digital and Programmable Systems 2

Tutorial 10 – Fixed and Floating Point - Solutions 1. Explain the meaning of [U]Qp.q notation Q indicates “fixed point”representation. Then: - q is the number of bits used to encode the fractional part. This is the most important number since it configures the full representation. Resolution is r = 2-q, the encoded value A representing number au is A=int(au x 2q) -U indicates if the representation is signed (then U is dropped) or unsigned. - p is the number of bits remaining to encode the integer part. If the total number of bits is n, then p = n - q (for Unsigned representation) or p = n -1 –q (for signed representation, using 2’s complement) 2. Represent 2.38 in UQ2.6 and calculate the representation error. Could it be represented in UQ1.7? UQ2.6: q=6  A = int(26 x 2.38) = 152 = 100110002 = 0x98. The represented number is actually au =152/26=2.375, error is (2.38 – 2.375) = 0.005 In UQ1.7: q=7  A = int(27 x 2.38) = 304 >255  Can not be represented in 8 bits. 3. What value is 0xA3 representing in UQ.8? And in Q.7? 0xA3 = 163 = 101000112. In UQ.8, no sign is considered so it is a positive number. The encoded number is au=163/28= 0.63671875. In Q.7, we use 2’s complement so 0xA3 = 101000112’sC = -93. Then the encoded number is au=-93/27= -0.7265625. 4. Represent the number π using 16-bit fixed-point, choosing the format that gives the smallest error. Calculate the error. The number to encode is 3.14… having an integer part of 3. To encode this integer part, at least 2 bits are needed so p≥2. To use as many possible bits for q, we choose p=2 and unsigned representation, so q=16-2=14. We’ll use UQ2.14. The encoded value is then A = int(214 x π) = 51471 = 0xC90F. The encoded number is actually au=51471/214= 3.14154052734375. Error is then ≈5.2e-5. 5. What is the difference is range and precision between single and double precision floating point values using IEEE 754? Single precision: 32 bit (1 sign + 8 exponent (bias 127) + 23 mantissa). Range = ±1.1111…11x2126. Precision (minimum non-zero value) is 1.0x2-127. Double precision: 64 bit (1 sign + 11 exponent (bias 1023) + 52 mantissa). Range = ±1.1111…11x21022. Precision (minimum non-zero value) is 1.0x2-1023. 6. What is the value 1 10000001 10101000000000000000000 in decimal, if it is defined in IEEE 754 single precision? Sign bit=1 negative value Exponent = 10000001 = 129. Bias 127  e=129-127=2 Mantissa (denorm) = (1.)10101 Number = -1.10101x22 =-110.101= -6.62510 G Morison / M Mata

page 1

M3H623544 Digital and Programmable Systems 2

7. Convert -35.75 to its binary and hexidecimal representation in IEEE floating point format -35.7510 = -100011.112 = -1.0001111x25  sign bit=1, e=5, normalized mantissa= 0001111 e=5, bias=127  Exp=127+5=132=100001002 The floating point representation is then 1 10000100 00011110000000000000000 = 0xC20F0000 8. Convert the hexidecimal IEEE format floating point number 0x40200000 to decimal 0x40200000 = 0100 0000 0010 0000 0000 0000 0000 0000 = 0 10000000 01000…0 So sign bit is 0 (positive); E=10000000=128  e=128-127=1; denormalized mantissa =(1.)0100….0 The number is then +1.01x21=+10.12 = +2.510 9. Write the number π = 3.1415 in double precision floating point (binary representation is infinite, so calculate only 8 binary digits for the fractional part)

π = 3.141510 ≈ 11.001001002 = 1.1001001x21 sign bit =0 e=1, with bias 1023 then E=1023+1= 1024 = 100000000002 (11 bits) normalized mantissa = 100100100…0 (52 bits) The floating point representation is then 0 10000000 100100100…0000000 = 0x4009200000000000 10. Explain what NaN and Inf are within the IEEE 754 floating point standard These are reserved combinations of mantissa and exponent. Instead of encoding a normalized number, they represent an expression that can’t be evaluated as a number (Not a Number), or a division by 0 result (Infinite). Infinite is represented with the maximum value for E (2n-1), mantissa = 0. NaN is also represented with the maximum value for E (2n-1), but a non-zero mantissa.

G Morison / M Mata

page 2...