1. Introduction and Finite Precision Arithmetic PDF

Title	1. Introduction and Finite Precision Arithmetic
Course	Fundamentals of Scientific Computing
Institution	University of Dundee
Pages	11
File Size	309 KB
File Type	PDF
Total Downloads	60
Total Views	151

Preview

CLICK TO PREVIEW PDF

Summary

lec1...

Description

MA32005: FUNDAMENTALS OF SCIENTIFIC COMPUTING: LECTURE NOTES

1. Introduction and Finite Precision Arithmetic 1.1. Introduction: Why Scientific Computing? You have already encountered several kinds of problems that may not admit a closed-form solution. Examples: • Integration:

Z

cos(x) p dx; x2 + 1

! ◆ ✓ p ⇡x(t) 1 2 with x(0) = 2; + x (t) sin • ODEs (Ordinary Differential Equations): x (t) = 1  2 2 • Roots of polynomials: The roots of a general polynomial of degree 5 or higher have no-closed form formulas. E.g., x5  x  1. Even for polynomials of degree 3 or 4 the closed-form formulas are too complicated to remember them. Moreover, even expressions that we consider of “closed-form” need to be evaluated numerically. ✓ ◆ 4 Examples: e5 , sin , ⇡, det(A) for a known large matrix A. 3 Because of this, it is essential to: (1) Develop algorithms for efficient and reliable computations; r

0

(2) Be able to analyse the behaviour of those algorithms (e.g., prove that they are indeed efficient and reliable). 1.2. A striking example: Computing ⇡. As strange as this looks at first glance, mathematically equivalent algorithms do not always behave the same computationally. Example(Archimedes’ sequence): It is known that Archimedes sequence r ⇣ ⌘ p (1.1) y1 = 2, y n+1 = 2n 2 1  1  (yn /2n )2 , n = 1, 2, . . . ,

converges to ⇡, lim yn = ⇡. However, implementing numerically this recursive formula gives, for n!1

large n, an answer that has nothing to do with ⇡! (cf. Figure 1). Archimedes’ sequence (1.1) can be rewritten equivalently as 2zn (1.2) z1 = 2, zn+1 = r ⇣ ⌘ , n = 1, 2, . . . p n 2 2 1 + 1  (zn /2 )

In particular, yn = zn , n = 1, 2, . . .. Implementing numerically the mathematically equivalent formula (1.2), gives, for large n, an answer that is very close to ⇡! (cf. Figure 2). 1

2

Figure 1. Approximations of ⇡ using the unstable formula (1.1) and the corresponding error.

Figure 2. Approximations of ⇡ using the stabilised version (1.2) of Archimedes’ formula and the corresponding error.

Why is that? To understand what went wrong in the first formula, we need to understand how numbers are treated in the computer. (See also Problem 1 on Assignment 1.) 1.3. Number Systems. The number system we are used to is the decimal system; in this system the number 10 is called the base. In fact, every decimal number can be written using the digits 0, 1, 2, . . . , 9. Example: 1234.5678 is a shorthand notation of writing 1 ⇥ 103 + 2 ⇥ 102 + 3 ⇥ 101 + 4 ⇥ 100 + 5 ⇥ 101 + 6 ⇥ 102 + 7 ⇥ 103 + 8 ⇥ 104 . Actually, we can take any other positive number N > 1 and create other Number Systems with base N.

3

System Name Base (N > 1) Digits used (0, 1, . . . , N  1) binary N =2 0, 1 octal N =8 0, 1, 2, 3, 4, 5, 6, 7 decimal N = 10 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 hexadecimal N = 16 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f Note: Computers use the binary system. Example: Expand the shorthand notation of the following numbers: (i) (1011.011)2 = 1 ⇥ 23 + 0 ⇥ 22 + 1 ⇥ 21 + 1 ⇥ 20 + 0 ⇥ 21 + 1 ⇥ 22 + 1 ⇥ 23 . (ii) (2f.9b)16 = 2 ⇥ 161 + 15 ⇥ 160 + 9 ⇥ 161 + 11 ⇥ 162 . Example: Convert from one system to another: (i) (1011.011)2 to the decimal system: (1011.011)2 = 1 ⇥ 23 + 0 ⇥ 22 + 1 ⇥ 21 + 1 ⇥ 20 + 0 ⇥ 21 + 1 ⇥ 22 + 1 ⇥ 23 1 1 = 8 + 2 + 1 + + = 11.325 = (11.325)10 . 4 8 (ii) 156 to the binary system: In order to write 156 (which is an integer number) in the binary system we follow the following procedure: First we divide 156 by the base 2 and write 156 as the quotient (78 in this case) multiplied by the divisor (2 in this case) plus the remainder (0 in this case). Next we take the obtained quotient (78 in this case) divide it by the base 2 and write it as the quotient (39 in this case) multiplied by the divisor (2 in this case) plus the remainder (0 in this case). We repeat the procedure until the quotient becomes 0. Then we write the remainders starting from the last division and going up to the first. What we get is the number in the new (binary in this case) number system. In particular we have: 78 = 39 ⇥ 2 + 0 39 = 19 ⇥ 2 + 1

19 = 9 ⇥ 2 + 1 9 =4⇥2+1 4 =2⇥2+0 2 =1⇥2+0 1 = 0 ⇥ 2 + 1.



156 = 78 ⇥ 2 + 0

Thus: (156)10 = (10011100)2 .

(iii) 17.125 to the binary system: First, we write 17.125 as the sum of its integer and decimal parts: 17.125 = 17 +0.125. Then we write 17 and 0.125 in the binary system and add the results we found to get the representation for 17.125. For 17 we proceed as the previous example:

4 =2⇥2+0 2 =1⇥2+0



17 = 8 ⇥ 2 + 1 8 =4⇥2+0

1 = 0 ⇥ 2 + 1.

Thus: 17 = (10001)2 .

4

0.125 ⇥ 2 = 0 + 0.25 0.25 ⇥ 2 = 0 + 0.5 0.5 ⇥ 2 = 1 + 0



To write the decimal part (0.125 in this case) in the binary system we proceed as follows: We multiply the decimal part (0.125 in this case) by the base 2 and write the result as the integer part plus the decimal part. We take the new decimal part (0.25 in this case) and we multiply it again by the base 2 and write the result as the integer part plus its decimal part. We repeat the procedure until the decimal part becomes 0. Then we write the integer parts starting from the first multiplication and going down to the last. What we get is the number in the new (binary in this case) number system. In particular we have:

Thus: 0.125 = (0.001)2 .

1 Note: 0.125 = 3 , so the procedure for writing the decimal part in the binary system could 2 have been avoided. Therefore, 17.125 = (10001.001)2 . (iv) (1.8)10 to the binary system: Proceeding as in the previous example, we get: (1.8)10 = (1. 1100 |{z } 1100 · · · )2 . repeated pattern

We notice that in this case we get a non-terminating (infinite) binary representation, although the representation in the decimal system is finite.

In general, in the number system with base N > 1, any positive real number x has a representation of the form: x = am N m + am1 N m1 + · · · a1 N 1 + a0 N 0 + a1 N 1 + a2 N 2 + · · · 1 X (1.3) = ami N mi , i=0

where 0  ami < N for all i  1 and 0 < am < N.

Question: Is the above representation of x unique? Prove or give a counter-example. The large (infinite) representation of the number x is not feasible in computations; in the computers we can only deal with numbers that have terminating representation, i.e., of the following form: x=

n1 X

ami N mi ,

i=0

where 0  ami < N and 0 < am < N . The numbers ami , i = 0, 1, . . . , n  1 are called significant digits and the number x is said to have n significant digits. ? If N = 10, then ami , i = 0, 1, . . . , n  1 are called significant decimal digits. 1.4. Chopping and Rounding. If a number x has more significant digits that we can actually handle or wish to have, then we need to replace x by an approximate number x⇤ which contains a smaller number of digits, say n. There are two ways of obtaining x⇤ from x: by chopping or rounding. In chopping we retain only the first n significant digits in the number x. In rounding, we follow the following rules: (i) Retain the first n significant digits;

5

(ii) If the (n + 1)th significant digit is less than 5, we leave the nth significant digit unchanged; if the (n + 1)th significant digit is greater or equal to 5, add 1 to the nth significant digit. Example: x⇤ n x chopping rounding 3 123.456 123 123 4 123.456 123.4 123.5 5 123.456 123.45 123.46 ? Note: Chopping and rounding are applicable to any number system.

1.5. How numbers are stored in the computer? Normalised scientific notation. In the decimal system, any real number with finite number of digits can be expressed in normalised scientific notation. This means that the decimal point is shifted and appropriate power of 10 is supplied so that only one decimal digit, which must be non-zero, is before the decimal point. Examples: (i) 732.5051 = 7.325051 ⇥ 102 . (ii) 0.005612 = 5.6612 ⇥ 103 .

Generalisation to the number system with base N gives the following normalised real number: ±(a1 .a2 a3 . . . am ) ⇥ N e ,

with a1 6= 0,

where the number a1 .a2 a3 . . . am is written with respect to the number system in base N . This is the normalised scientific notation in base N . The digits a1 , a2 , a3 , . . . , am are the significant digits, e is called the exponent and N is the base. The factor a2 a3 . . . am is called mantissa (or the fractional part). Comment: In the literature you may also meet the form (a1 .a2 a3 . . . am ) ⇥ N e with a1 6= 0 for the definition of the normalised scientific notation. This is a convention; the idea of this notation is that the number that multiplies N e is of order 1. Computer number representation. Computers use the binary system to store numbers. A bit is one digit in the binary system. Most modern computers use number representations from IEEE 754-2008 standard arithmetic. Here we will discuss double precision numbers (64bits). A double precision 64bit IEEE number x is stored in the computer in the following form: (1.4)

x = (1)s ⇥ 1.f ⇥ 2e1023 ,

where 1.f is in the binary form and 1023 is also written in the binary form, i.e., 1023 = (1111111111)2 . In (1.4), • 1 bit is used for the sign; s = 0 or 1 (0 for +, 1 for ); • 11 bits are used for the non-negative exponent e; • 52 bits are used for the mantissa f . Definition 1.1. All numbers of the above form (1.4) are called floating point numbers (or machine numbers).

6

• What is the largest floating point number realmax? To answer this question, we first note that the largest value of the exponent e is ( 11111111111 | {z } )2 = (2047), 11 bits

while the largest value of the mantissa f is ( 11 · · · 1} )2 . | {z 52 bits

Therefore,

✓ ◆ 1 1 1 1 · · · }1 )2 ⇥220471023 = 1 + + 2 + 3 + · · · + 52 ⇥ 101024 realmax = (1. |11{z 2 2 2 2 52 bits

= (2  252 )21024 ⇡ 10308 .

• What is the smallest positive floating point number realmin? Proceeding similarly as before we get

· · · 0} )2 ⇥201023 ⇡ 10308 , realmin = (1. 00 | {z 52 bits

since the smallest positive value for the mantissa f in this case is ( 00 | ·{z· · 0} )2 and the smallest 52 bits

positive value for the exponent e is ( 00000000000 | {z } )2 11 bits

Thus there is a finite range for the machine numbers x, namely, realmin  |x|  realmax.

If the absolute value of a number is: • Larger than realmax, overflow occurs and the computer usually returns inf. • Smaller than realmin, underflow occurs and the computer usually returns zero. inf , inf  inf, etc in the machine. ? NaN is used to describe things like inf Machine epsilon, eps. Let’s go back to the floating point representation (1.4) and consider only positive floating point numbers (i.e., s = 0) with exponent e = 1023 (so that e  1023 = 0). Then the floating point number will be of the form: x = 1.f. The smallest and largest floating point numbers in this case are x = 1 and · · · 1} )2 = 2  252 , x = (1. 11 | {z 52 bits

respectively, i.e., the largest floating point number in this case is very close to 2. Consider the interval [1, 2). • How many different floating point numbers exist in the interval [1, 2)? Since each of the 52 bits in the mantissa f can be 0 or 1, there exist 252 different floating point numbers in [1, 2). • What is the distance between two successive floating point numbers in [1, 2)? The 252 numbers are equidistributed in the interval [1, 2) of length 1. Thus the distance between two successive floating point numbers in [1, 2) is 252 . Definition 1.2. The number 252 , which is the distance between two successive floating point numbers in [1, 2) is called the machine epsilon and it is denoted by eps.

7

Note: eps is the smallest floating point number for which 1 + eps is a floating point number not equal to 1. Similarly, if we take positive floating point numbers with exponent e = 1024, i.e., of the form x = (1.f) ⇥ 2,

then the smallest and largest floating point numbers in this case are 2 and (2  252 ) ⇥ 2, respectively. There are again 252 different floating point numbers equidistributed in the interval [2, 4), and now the distance between two successive numbers is eps ⇥ 2 = 2 52 ⇥ 2 = 251 > eps. We can repeat the procedure for any positive floating point number, and in particular for any exponent e, 0  e  2047; the distance between two successive numbers in the interval [2e1023 , 2e1022 ) is eps ⇥ 2e1023 = 252 ⇥ 2e1023 < eps if e  1023 < 0 and eps ⇥ 2e1023 = 252 ⇥ 2e1023 > eps if e  1023 > 0. A sketch 1 showing the form of distribution of the machine numbers with 3bit mantissa in [ , 8) is displayed 8 in Figure 3.

1 Figure 3. Sketch of the form of distribution of machine numbers with 3bit mantissa in [ , 8). 8

Further, given a number to be stored in the computer, it is rounded to its closest machine number. If it happens this number to be in the interval [1, 2), then the maximum absolute error that occurs is eps . Similarly we can find the maximum absolute error that occurs when a number is stored in the 2 computer for every binary interval of the form [2e1023 , 2e1022 ). Thus we make errors when we store numbers in the computer. This is an example of round-off errors. Round-of f errors occur because arithmetic performed in a machine can only use a finite number of digits. 1.6. Absolute and relative errors. Let x⇤ be an approximation to x. Then • |x  x⇤ | is called the absolute error for x; •

|x  x⇤ | with x 6= 0 is called the relative error for x. |x|

Let f l(x) be the floating point representation of x. Then the absolute error and relative error for x are given by |x  f l(x)| , |x  f l(x)| and |x| respectively. Question: Is an absolute error of a number x large or small? Answer: It depends. If the number is less or equal than 1, then an absolute error 10 is huge;

8

however, if the number is 10300 , then an absolute error 10 is negligible. Note that in the latter case the relative error is 10299 . Thus, the relative error in the approximation of a number gives us a sense of the quality of the approximation. Theorem 1.1. Let x 2 [realmin, realmax] and let f l(x) be the floating point representation of x obtained by rounding. Then (1.5)

eps |x  f l(x)|  . |x| 2

Proof. We will give the proof in the interval [1, 2). For the general case, cf. Problem 3i of Tutorial 1. eps 1 1  1. Let x 2 [1, 2). Then |x  f l(x)|  . Also since x 2 [1, 2), we have that 1  |x| < 2 or < |x| 2 2 Thus eps eps |x  f l(x)| 1⇥ = . |x| 2 2 ⇤ |x  f l(x)| eps eps eps ]. The represen is equivalent to f l(x) = x(1 + ") for some " 2 [ , 2 |x| 2 2 eps eps , ] is often more convenient when we tation of f l(x) of x as f l(x) = x(1 + ") for some " 2 [ 2 2 want to work with floating point numbers. Note:

1.7. Computations in floating point arithmetic. We formulate this by an example. Example: Let x, y 2 [realmin, realmax]. What is the worst-case relative error of the addition on the computer? Solution: Let f l(x) and f l(y) the floating point representation of x and y, respectively. Then eps eps , ] f l(x) = x(1 + "1 ) for some "1 2 [ 2 2 (1.6) eps eps , ] f l(y) = y(1 + "2 ) for some "2 2 [ 2 2 Thus, f l (f l(x) + f l(y)) is the floating point representation of x + y. Therefore eps eps ]. f l (f l(x) + f l(y)) = (f l(x) + f l(y)) (1 + "3 ) for some "3 2 [ , 2 2 Using (1.6) and by doing some basic computations on the above relation we get: f l (f l(x) + f l(y)) = (x(1 + "1 ) + y(1 + "2 )) (1 + "3 ) = (x + "1 x + y + "2 y)(1 + "3 ) = x + "1 x + y + "2 y + "3 x + "3 "1 x + "3 y + "3 "2 y = (x + y) + ("1 + "3 + "3 "1 )x + ("2 + "3 + "3 "2 )y. Thus we can estimate the relative error as follows: |("1 + "3 + "3 "1 )x + ("2 + "3 + "3 "2 )y| |f l (f l(x) + f l(y))  (x + y)| = |x + y| |x + y| x|"1 + "3 + "3 "1 | + y|"2 + "3 + "3 "2 |  x+y x(|"1 | + |"3 | + |"3 | |"1 |) + y(|"2 | + |"3 | + |"3 | |"2 |) ,  x+y

9

where in the last two inequalities we have used the triangle inequality and the fact that both x and y are positive. Therefore ◆ ◆ ✓ ✓ eps eps eps2 eps eps eps2 +y + + + + x |f l (f l(x) + f l(y))  (x + y)| 4 4 2 2 2 2  x+y |x + y| ✓ ◆ eps eps eps2 (x + y) + + 4 2 2 = , x+y from where we conclude that |f l (f l(x) + f l(y))  (x + y)| eps2  eps + . |x + y| 4 Note: We have seen that we do make errors in storing the numbers in the computer, but also in basic arithmetic calculations (addition, subtraction,1 multiplication, division). Due to this, basic number properties (e.g. distributive, associative) may not hold exactly. 1 . The distributive property gives 1000 z(x + y) = zx + zy.

Example: Let x = 1001, y = 1 and z = (1.7)

Set w = z(x +y), w1 = zx and w2 = zy. Assume that the arithmetic calculations for the computation of w, w1 and w2 are performed exactly and then we store w, w1 , w 2 and we perform the addition w1 + w2 in the computer. Do we expect (1.7) to hold exactly? Solution: Since w = z(x + y) = 1 and 1 is a machine number we have f l(w) = f l(1) = 1. On the other hand, since w1 = zx = 1.001 and w2 = zy = 0.001 we get f l (f l(w1 ) + f l(w2 )) = f l (1.001(1 + "1 )  0.001(1 + "2 ))

= (1.001 + "1 ⇥ 1.001  0.001  "2 ⇥ 0.001) ⇥ (1 + "3 ) for some "1 , "2 , " 3 2 [

eps eps , ]. 2 2

Therefore, f l (f l(w1 ) + f l(w2 )) = 1 + "3 + "1 ⇥ 1.001  "2 ⇥ 0.001 + "3 ("1 ⇥ 1.001  "2 ⇥ 0.001), which in general is not equal to 1 because "i 6= 0 for i = 1, 2, 3. Catastrophic cancelation. Computations in the computer involving the subtraction of two nearly equal numbers can result in considerable loss of accuracy due to cancelation. Example: Let x, y 2 [realmin, realmax]. What is the worst-case relative error of the subtraction on the computer? Solution: You can show that (cf. Problem 3 in Tutorial 1) ◆ ✓ eps2 |x + y| |f l (f l(x)  f l(y))  (x  y)|  eps + . 4 |x  y| |x  y| Thus the subtraction of numbers that are very close to each other has the potential to destroy the numerical approximation and must be avoided. When we come across x  y for x ⇡ y, we should always look for ways around that. 1Subtraction can go really wrong! See the discussion below.

10

For example if we have two numbers x, y with x > 1, y > 1 and x ⇡ y we could use xy =

xn  y n x3  y 3 x2  y 2 = · · · = = 2 . x + xy + y2 x+y xn1 + xn2 y + · · · + xy n2 + yn1

We expect that the above different ways of writing x  y will produce stable formulas. E.g., if x = 1.1 and y = 1.05 then x  y = 0.05, while x100 ⇡ 1.3781 ⇥ 104 and y100 ⇡ 1.315 ⇥ 102 , so x100 and y100 are no longer close to each other. A catastrophic calculation occurred in the first example of the chapter of Archimedes’ sequence x2  y 2 (1.1). The stable formula (1.2) was produced by writing x  y = . x+y

1.8. Numerical Stability. A numerical algorithm in which the effect of the numerical errors of the input data on the final result is negligible is called stable. Example. Consider the recursive formula (1.8)

1 In =  aIn1 , n = 1, 2, . . . n

with

✓

a+1 I0 = ln a

◆

for the computation of the integrals In =

Z

1 0

xn dx, n = 0, 1, . . . , a+x

I0⇤(=

f l(I0 )) be the machine number for I0 and assume that this is the only with a > 1. Let computational error we make when using formula (1.8). What is the effect of this error in the computation of In , for n  1? Solution: Let en = In  In⇤ , n = 0, 1, . . ., where en is the error committed due to the non-exact evaluation of I0 . Since this is the only computational error we get In⇤ =