Matheng Skript 1213

Download as pdf or txt
Download as pdf or txt
You are on page 1of 227

Advanced Mathematics for Engineers

Wolfgang Ertel
translated by Elias Drotle and Richard Cubek

October 1, 2012

Preface
Since 2008 this mathematics lecture is oered for the master courses computer science, mechatronics and electrical engineering. After a repetition of basic linear algebra, computer algebra and calculus, we will treat numerical calculus, statistics and function approximation, which are the most important mathematics basic topics for engineers. We also provide an introduction to Computer Algebra. Mathematica, Matlab and Octave are powerful tools for the Exercises. Event though we favour the open source tool Octave, the student is free to choose either one of the three. We are looking forward to work with interesting semesters with many motivated and eager students who want to climb up the steep, high and fascinating mountain of engineering mathematics together with us. I assure you that we will do our best to guide you through the sometimes wild, rough and challenging nature of mathematics. I also assure you that all your eorts and your endurance in working on the exercises during nights and weekends will pay o as good marks and most importantly as a lot of fun. Even though we repeat some undergraduate linear algebra and calculus, the failure rate in the exams is very high, in particular among the foreign students. As a consequence, we strongly recommend all our students to repeat undergraduate linear algebra such as operation on matrices like solution of linear systems, singularity of matrices, inversion, eigenvalue problems, row-, column- and nullspaces. You also should bring decent knowledge of onedimensional and multidimensional calculus, e.g. dierentiation and integration in one and many variables, convergence of sequences and series and nding extrema with constraints of multivariate functions. And basic statistics is also required. To summarize: If you are not able to solve problems (not only know the terms) in these elds, you have very little chances to successfully nish this course.

History of this Course


The rst version of this script was created in the winter semester 95/96. I had included in this lecture only Numerics, although I wanted to cover initially Discrete Mathematics too, which is very important for computer scientists. If you want to cover both in a lecture of three semester week hours, it can happen only supercially. Therefore I decided to focus like my colleagues on Numerics. Only then it is possible to impart profound knowledge. From Numerical Calculus besides the basics, systems of linear equations, various interpolation methods, function approximation, and the solution of nonlinear equations will be presented. An excursion into applied research follows, where e.g. in the eld of benchmarking of Microprocessors, mathematics (functional equations) is inuencing directly the practice of computer scientists. In summer 1998 a chapter about Statistics was added, because of the weak coverage at our University till then. In the winter semester 1999/2000, the layout and structure were improved, as well some mistakes have been removed. In the context of changes in the summer semester 2002 in the curriculum of Applied Computer science, statistics was shifted, because of the general relevance for all students, into the lecture Mathematics 2. Instead of Statistics, contents should be included, which are specically relevant for computer scientists. The generation and verication of random numbers is an important topic, which is nally also covered. Since summer 2008, this lecture is only oered to Master (Computer Science) students. Therefore the chapter about random numbers was extended. Maybe other contents will be included in the lecture. For some topics original literature will be handed out, then student

have to prepare the material by themselves. To the winter semester 2010/11 the lecture has now been completely revised, restructured and some important sections added such as radial basis functions, Gaussian processes and statistics and probability. These changes become necessary with the step from Diploma to Master. I want to thank Markus Schneider and Haitham Bou Ammar who helped me improve the lecture. To the winter semester 2010/11 the precourse will be integrated in the lecture in order to give the students more time to work on the exercises. Thus, the volume of lecture grows from 6 SWS to 8 SWS and we will now split it into two lectures of 4 SWS each. In the winter semester 2012/13 we go back to a one semester schedule with 6 hours per week for computer science and mechatronics students. Electrical engineering students will only go for four hours, covering chapters one to six. Wolfgang Ertel

Contents
1 Linear Algebra 1.1 Video Lectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Computer Algebra 2.1 Symbol Processing on the Computer . . 2.2 Short Introduction to Mathematica . . . 2.3 Gnuplot, a professional Plotting Software 2.4 Short Introduction to MATLAB . . . . . 2.5 Short Introduction to GNU Octave . . . 2.6 Exercises . . . . . . . . . . . . . . . . . . 3 Calculus Selected Topics 3.1 Sequences and Convergence . . . . . . 3.2 Series . . . . . . . . . . . . . . . . . . 3.3 Continuity . . . . . . . . . . . . . . . . 3.4 TaylorSeries . . . . . . . . . . . . . . 3.5 Dierential Calculus in many Variables 3.6 Exercises . . . . . . . . . . . . . . . . . 4 Statistics and Probability Basics 4.1 Recording Measurements in Samples 4.2 Statistical Parameters . . . . . . . . 4.3 Multidimensional Samples . . . . . . 4.4 Probability Theory . . . . . . . . . . 4.5 Discrete Distributions . . . . . . . . . 4.6 Continuous Distributions . . . . . . . 4.7 Exercises . . . . . . . . . . . . . . . . 3 3 3 11 12 13 18 19 22 30 32 32 34 37 42 46 65 69 69 71 72 75 79 81 85 88 88 92 100 110 113 113 118 125 137

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

5 Numerical Mathematics Fundamentals 5.1 Arithmetics on the Computer . . . . . . 5.2 Numerics of Linear Systems of Equations 5.3 Roots of Nonlinear Equations . . . . . . 5.4 Exercises . . . . . . . . . . . . . . . . . . 6 Function Approximation 6.1 Polynomial Interpolation 6.2 Spline interpolation . . . 6.3 Method of Least Squares 6.4 Exercises . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . and Pseudoinverse . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2 7 Statistics and Probability 7.1 Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Calculation of Means - An Application for Functional Equations 7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . 7.5 Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Gaussian Distributions . . . . . . . . . . . . . . . . . . . . . . . 7.7 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Function Approximation 8.1 Linear Regression Summary . . . . . . . . . . . . . 8.2 Radial Basis Function Networks . . . . . . . . . . . . 8.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Singular Value Decomposition and the Pseudo-Inverse 8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS 141 141 148 153 155 160 163 166 168 177 179 179 180 188 192 197 198 198 203 205 211 219 224

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

9 Numerical Integration and Solution of Ordinary Dierential Equations 9.1 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Numerical Dierentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Numerical Solution of Ordinary Dierential Equations . . . . . . . . . . . . . 9.4 Linear Dierential Equations with Constant Coecients . . . . . . . . . . . 9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography

Chapter 1 Linear Algebra


1.1 Video Lectures

We use the excellent video lectures from G. Strang, the author of [1], available from: http:// ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010. In particular we show the following lectures: Lec # 1 2 3 4 5 6 7 8 9 10 11 12 13 Topics The geometry of linear equations (lecture 01) Transposes, Permutations, Spaces Rn (lecture 05) Column Space and Nullspace (lecture 06) Solving Ax = 0: Pivot Variables, Special Solutions (lecture 07) Independence, Basis, and Dimension (lecture 09) The Four Fundamental Subspaces (lecture 10) Orthogonal Vectors and Subspaces (lecture 14) Properties of Determinants (lecture 18) Determinant Formulas and Cofactors (lecture 19) Cramers rule, inverse matrix, and volume (lecture 20) Eigenvalues and Eigenvectors (lecture 21) Symmetric Matrices and Positive Deniteness (lecture 25) Linear Transformations and Their Matrices (lecture 30)

1.2

Exercises

Exercise 1.1 Solve the nonsingular triangular system u + v + w = b1 v + w = b2 w = b3 (1.1) (1.2) (1.3)

Show that your solution gives a combination of the columns that equals the column on the right. Exercise 1.2 Explain why the system u+v+w =2 u + 2v + 3w = 1 v + 2w = 0 (1.4) (1.5) (1.6)

1 Linear Algebra

is singular, by nding a combination of the three equations that adds up to 0 = 1. What value should replace the last zero on the right side, to allow the equations to have solutions, and what is one of the solutions?

Inverses and Transposes


Exercise 1.3 exists)? Which properties of a matrix A are preserved by its inverse (assuming A1

(1) A is triangular (2) A is symmetric (3) A is tridiagonal (4) all entries are whole numbers (5) all entries are fractions (including whole numbers like 3 ) 1 Exercise 1.4 a) How many entries can be chosen independently, in a symmetric matrix of order n? b) How many entries can be chosen independently, in a skew-symmetric matrix of order n?

Permutations and Elimination


Exercise 1.5 a) Find a square 3 3 matrix P , that multiplied from left to any 3 m matrix A exchanges rows 1 and 2. b) Find a square n n matrix P , that multiplied from left to any n m matrix A exchanges rows i and j . Exercise 1.6 A permutation is a bijective mapping from a nite set onto itself. Applied to vectors of length n, a permutation arbitrarily changes the order of the vector components. The word ANGSTBUDE is a permutation of BUNDESTAG. An example of a permutation on vectors of length 5 can be described by (3, 2, 1, 5, 4). This means component 3 moves to position 1, component 2 stays where it was, component 1 moves to position 3, component 5 moves to position 4 and component 4 moves to position 5. a) Give a 5 5 matrix P that implements this permutation. b) How can we come from a permutation matrix to its inverse? Exercise 1.7 a) Find a 3 3 matrix E , that multiplied from left to any 3 m matrix A adds 5 times row 2 to row 1. b) Describe a n n matrix E , that multiplied from left to any n m matrix A adds k times row i to row j . c) Based on the above answers, prove that the elimination process of a matrix can be realized by successive multiplication with matrices from left.

1.2 Exercises

Column Spaces and NullSpaces


Exercise 1.8 Which of the following subsets of R3 are actually subspaces? a) The plane of vectors with rst component b1 = 0. b) The plane of vectors b with b1 = 1. c) The vectors b with b1 b2 = 0 (this is the union of two subspaces, the plane b1 = 0 and the plane b2 = 0). d) The solitary vector b = (0, 0, 0). e) All combinations of two given vectors x = (1, 1, 0) and y = (2, 0, 1). f ) The vectors (b1 , b2 , b3 ) that satisfy b3 b2 + 3b1 = 0. Exercise 1.9 Let P be the plane in 3-space with equation x + 2y + z = 6. What is the equation of the plane P0 through the origin parallel to P ? Are P and P0 subspaces of R3 ? Exercise 1.10 Which descriptions are correct? The solutions x of x 1 1 1 1 0 x2 = Ax = 1 0 2 0 x3 form a plane, line, point, subspace, nullspace of A, column space of A.

(1.7)

Ax = 0 and Pivot Variables


Exercise 1.11 For the matrix A= 0 1 4 0 0 2 8 0 (1.8)

determine the echelon form U , the basic variables, the free variables, and the general solution to Ax = 0. Then apply elimination to Ax = b, with components b1 and b2 on the right side; nd the conditions for Ax = b to be consistent (that is, to have a solution) and nd the general solution in the same form as Equation (3). What is the rank of A? Exercise 1.12 Write the general solution to u 1 2 2 1 v = 2 4 5 4 w

(1.9)

as the sum of a particular solution to Ax = b and the general solution to Ax = 0, as in (3). Exercise 1.13 Find the value of c which makes it possible to solve u + v + 2w = 2 2u + 3v w = 5 3u + 4v + w = c (1.10) (1.11) (1.12)

1 Linear Algebra

Solving Ax = b
Exercise 1.14 Is it true that if v1 , v2 , v3 are linearly independent, that also the vectors w1 = v1 + v2 , w2 = v1 + v3 , w3 = v2 + v3 are linearly independent? (Hint: Assume some combination c1 w1 + c2 w2 + c3 w3 = 0, and nd which ci are possible.) Exercise 1.15 Find a counterexample to the following statement: If v1 , v2 , v3 , v4 is a basis for the vector space R4 , and if W is a subspace, then some subset of the v s is a basis for W. Exercise 1.16 Suppose V is known to have dimension k . Prove that a) b) any k independent vectors in V form a basis; any k vectors that span V form a basis.

In other words, if the number of vectors is known to be right, either of the two properties of a basis implies the other. Exercise 1.17 Prove that if V and W are three-dimensional subspaces of R5 , then V and W must have a nonzero vector in common. Hint: Start with bases of the two subspaces, making six vectors in all.

The Four Fundamental Subspaces


Exercise 1.18 Find the dimension and construct a basis for the four subspaces associated with each of the matrices A= 0 1 4 0 0 2 8 0 and U = 0 1 4 0 0 0 0 0 (1.13)

Exercise 1.19 If the product of two matrices is the zero matrix, AB = 0, show that the column space of B is contained in the nullspace of A. (Also the row space of A is the left nullspace of B , since each row of A multiplies B to give a zero row.) Exercise 1.20 Explain why Ax = b is solvable if and only if rank A = rank A , where A is formed from A by adding b as an extra column. Hint: The rank is the dimension of the column space; when does adding an extra column leave the dimension unchanged? Exercise 1.21 Suppose A is an m by n matrix of rank r. Under what conditions on those numbers does a) A have a two-sided inverse: AA1 = A1 A = I ? b) Ax = b have innitely many solutions for every b? Exercise 1.22 If Ax = 0 has a nonzero solution, show that AT y = f fails to be solvable for some right sides f . Construct an example of A and f .

Orthogonality
Exercise 1.23 In R3 nd all vectors that are orthogonal to (1, 1, 1) and (1, -1, 0). Produce from these vectors a mutually orthogonal system of unit vectors (an orthogonal system) in

1.2 Exercises R3 . Exercise 1.24 Show that x y is orthogonal to x + y if and only if x = y .

Exercise 1.25 Let P be the plane (not a subspace) in 3-space with equation x + 2y z = 6. Find the equation of a plane P parallel to P but going through the origin. Find also a vector perpendicular to those planes. What matrix has the plane P as its nullspace, and what matrix hast P as its row space?

Projections
Exercise 1.26 Suppose A is the 4 4 identity matrix with its last column removed. A is 4 3. Project b = (1, 2, 3, 4) onto the column space of A. What shape is the projection matrix P and what is P ?

Determinants
Exercise 1.27 n? How are det(2A), det(A), and det(A2 ) related to det A, when A is n by

Exercise 1.28 Find the determinants of: a) a rank one matrix 1 A = 4 2 1 2 2 b) the upper triangular matrix 4 0 U = 0 0 c) the lower triangular matrix U T ; d) the inverse matrix U 1 ; e) the reverse-triangular matrix that results from row exchanges, 0 0 M = 0 4 0 0 1 4 0 2 2 8 2 6 2 8 4 1 0 0 8 2 2 0 8 2 6 2

(1.14)

(1.15)

(1.16)

Exercise 1.29 If every row of A adds to zero prove that det A = 0. If every row adds to 1 prove that det(A I ) = 0. Show by example that this does not imply det A = 1.

1 Linear Algebra

Properties of Determinants
Exercise 1.30 Suppose An is the n by n tridiagonal matrix with 1s everywhere on the three diagonals: 1 1 0 1 1 A1 = 1 , A2 = , A3 = 1 1 1 , ... 1 1 0 1 1 Let Dn be the determinant of An ; we want to nd it. a) Expand in cofactors along the rst row of An to show that Dn = Dn1 Dn2 . b) Starting from D1 = 1 and D2 = 0 nd D3 , D4 , ..., D8 . By noticing how these numbers cycle around (with what period?) nd D1000 . Exercise 1.31 Explain why a 5 by 5 matrix with a 3 by 3 zero submatrix is sure to be a singular (regardless of the 16 nonzeros marked by xs): x x the determinant of A = 0 0 0 x x 0 0 0 x x 0 0 0 x x x x x x x x is zero. x x

(1.17)

(1.18)

Exercise 1.32 If A is m by n and B is n by m, show that det = 0 A B I = det AB. Hint: Postmultiply by I 0 . B I (1.19)

Do an example with m < n and an example with m > n. Why does the second example have det AB = 0?

Cramers rule
Exercise 1.33 The determinant is a linear function of the column 1. It is zero if two columns are equal. When b = Ax = x1 a1 + x2 a2 + x3 a3 goes into the rst column of A, then the determinant of this matrix B1 is |b a2 a3 | = |x1 a1 + x2 a2 + x3 a3 a2 a3 | = x1 |a1 a2 a3 | = x1 detA

a) What formula for x1 comes from left side = right side? b) What steps lead to the middle equation?

Eigenvalues and Eigenvectors


Exercise 1.34 Suppose that is an eigenvalue of A, and x is its eigenvector: Ax = x. a) Show that this same x is an eigenvector of B = A 7I , and nd the eigenvalue. b) Assuming = 0, show that x is also an eigenvector of A1 and nd the eigenvalue.

1.2 Exercises

Exercise 1.35 Show that the determinant equals the product of the eigenvalues by imagining that the characteristic polynomial is factored into det(A I ) = (1 )(2 ) (n ) and making a clever choice of . Exercise 1.36 Show that the trace equals the sum of the eigenvalues, in two steps. First, nd the coecient of ()n1 on the right side of (15). Next, look for all the terms in a11 a12 a1n a21 a22 a2n det(A I ) = det . (1.21) . . . . . . . . an1 an2 ann which involve ()n1 . Explain why they all come from the product down the main diagonal, and nd the coecient of ()n1 on the left side of (15). Compare. (1.20)

Diagonalization of Matrices
Exercise 1.37 Factor the following matrices into S S 1 : A= 1 1 1 1 and A = 2 1 . 0 0 (1.22)

Exercise 1.38 Suppose A = uv T is a column times a row (a rank-one matrix). a) By multiplying A times u show that u is an eigenvector. What is ? b) What are the other eigenvalues (and why)? c) Compute trace(A) = v T u in two ways, from the sum on the diagonal and the sum of s. Exercise 1.39 If A is diagonalizable, show that the determinant of A = S S 1 is the product of the eigenvalues.

Symmetric and Positive Semi-Denite Matrices


Exercise 1.40 If A = QQT is symmetric positive denite, then R = Q QT is its symmetric positive denite square root. Why does R have real eigenvalues? Compute R and verify R2 = A for A= 2 1 1 2 and A = 10 6 . 6 10 (1.23)

Exercise 1.41 If A is symmetric positive denite and C is nonsingular, prove that B = C T AC is also symmetric positive denite. Exercise 1.42 If A is positive denite and a11 is increased, prove from cofactors that the determinant is increased. Show by example that this can fail if A is indenite.

10

1 Linear Algebra

Linear Transformation
Exercise 1.43 Suppose a linear mapping T transforms (1, 1) to (2, 2) and (2, 0) to (0, 0). Find T (v ): (a) v = (2, 2) (b) v = (3, 1) (c) v = (1, 1) (d) v = (a, b) Exercise 1.44 Suppose T is reection across the 45o line, and S is reection across the y axis. If v = (2, 1) then T (v ) = (1, 2). Find S (T (v )) and T (S (v )). This shows that generally ST = T S . Exercise 1.45 Suppose we have two bases v1 , ..., vn and w1 , ..., wn for Rn . If a vector has coecients bi in one basis and ci in the other basis, what is the change of basis matrix in b = M c? Start from b1 v1 + ... + bn vn = V b = c1 w1 + ... + cn wn = W c. (1.24)

Your answer represents T (v ) = v with input basis of v s and output basis of ws. Because of dierent bases, the matrix is not I .

Chapter 2 Computer Algebra


Denition 2.1 Computer Algebra = Symbol Processing + Numerics + Graphics

Denition 2.2 Symbol Processing is calculating with symbols (variables, constants, function symbols), as in Mathematics lectures. Advantages of Symbol Processing:
often considerably less computational eort compared to numerics. symbolic results (for further calculations), proofs in the strict manner possible.

Disadvantages of Symbol Processing:


often there is no symbolic (closed form) solution, then Numerics will be applied, e.g.:

Calculation of Antiderivatives Solving Nonlinear Equations like: (ex = sinx) Example 2.1 1. symbolic:
x

lim

ln x x+1 =
1 (x x

=? (asymptotic behavior) + 1) ln x 1 ln x = 2 (x + 1) (x + 1)x (x + 1)2 1 ln x ln x 2 2 0 2 x x x

ln x x+1

x: 2. numeric:

ln x x+1

lim f (x) =?

12 Example 2.2 Numerical solution of x2 = 5 x2 = 5, x= x= iteration: xn+1 = 1 2 xn + 5 xn 1 2 5 , x x+ 2x = x + 5 x 5 x

2 Computer Algebra

xn n 0 2 Startwert 2.25 1 2 2.236111 3 2.23606798 4 2.23606798 (approximate solution) 5 = 2.23606798 108

2.1

Symbol Processing on the Computer

Example 2.3 Symbolic Computing with natural numbers: Calculation rules, i.e. Axioms necessary. Peano Axioms e.g.: x, y, z : x + y = y + x x+0 = x (x + y ) + z = x + (y + z ) Out of these rules, e.g. 0 + x = x can be deduced: 0+x = x+0 = x (2.1) (2.2) (2.1) (2.2) (2.3)

Implementation of symbol processing on the computer by Term Rewriting. Example 2.4 (Real Numbers) Chain Rule for Dierentiation: [f (g (x))] f (g (x))g (x) sin(ln x + 2) = cos(ln x + 2) Computer: (Pattern matching) sin(P lus(ln x, 2)) = cos(P lus(ln x, 2))P lus (ln x, 2) sin(P lus(ln x, 2)) = cos(P lus(ln x, 2))P lus(ln x, 2 ) 1 x

2.2 Short Introduction to Mathematica sin(P lus(ln x, 2)) = cos(P lus(ln x, 2))P lus sin(P lus(ln x, 2)) = cos(P lus(ln x, 2)) sin(P lus(ln x, 2)) = Eective systems:
Mathematica (S. Wolfram & Co.) Maple (ETH Zurich + Univ. Waterloo, Kanada)

13 1 ,0 x 1 x

cos(ln x + 2) x

2.2

Short Introduction to Mathematica


Library: Mathematica Handbook (Wolfram) Mathematica Documentation Online: http://reference.wolfram.com http://www.hs-weingarten.de/~ertel/vorlesungen/mae/links.html Some examples as jump start

Resources:

2.2.0.1

In[1]:= 3 + 2^3 Out[1]= 11 In[2]:= Sqrt[10] Out[2]= Sqrt[10] In[3]:= N[Sqrt[10]] Out[3]= 3.16228 In[4]:= N[Sqrt[10],60] Out[4]= 3.1622776601683793319988935444327185337195551393252168268575 In[5]:= Integrate[x^2 Sin[x]^2, x] 3 2 4 x - 6 x Cos[2 x] + 3 Sin[2 x] - 6 x Sin[2 x] Out[5]= -----------------------------------------------24 In[7]:= D[%, x] 2 2 12 x - 12 x Cos[2 x] Out[7]= ---------------------24

14
In[8]:= Simplify[%] 2 2 Out[8]= x Sin[x] In[9]:= Series[Exp[x], {x,0,6}] 2 3 4 5 6 x x x x x 7 Out[9]= 1 + x + -- + -- + -- + --- + --- + O[x] 2 6 24 120 720 In[10]:= Expand[(x + 2)^3 + ((x - 5)^2 (x + y)^2)^3]

2 Computer Algebra

2 3 6 7 8 9 Out[10]= 8 + 12 x + 6 x + x + 15625 x - 18750 x + 9375 x - 2500 x + 10 11 12 5 6 7 375 x - 30 x + x + 93750 x y - 112500 x y + 56250 x y 8 9 10 11 4 2 15000 x y + 2250 x y - 180 x y + 6 x y + 234375 x y 5 2 6 2 7 2 8 2 9 2 281250 x y + 140625 x y - 37500 x y + 5625 x y - 450 x y + 10 2 3 3 4 3 5 3 6 3 15 x y + 312500 x y - 375000 x y + 187500 x y - 50000 x y + 7 3 8 3 9 3 2 4 3 4 7500 x y - 600 x y + 20 x y + 234375 x y - 281250 x y + 4 4 5 4 6 4 7 4 8 4 140625 x y - 37500 x y + 5625 x y - 450 x y + 15 x y + 5 2 5 3 5 4 5 5 5 93750 x y - 112500 x y + 56250 x y - 15000 x y + 2250 x y 6 5 7 5 6 6 2 6 3 6 180 x y + 6 x y + 15625 y - 18750 x y + 9375 x y - 2500 x y + 4 6 5 6 6 6 375 x y - 30 x y + x y

>

>

>

>

>

>

>

>

>

In[11]:= Factor[%] 2 3 4 2 3 2 Out[11]= (2 + x + 25 x - 10 x + x + 50 x y - 20 x y + 2 x y + 25 y 2 2 2 2 3 4 5 6 10 x y + x y ) (4 + 4 x - 49 x - 5 x + 633 x - 501 x + 150 x 7 8 2 3 4 5 20 x + x - 100 x y - 10 x y + 2516 x y - 2002 x y + 600 x y 6 7 2 2 2 2 3 2 80 x y + 4 x y - 50 y - 5 x y + 3758 x y - 3001 x y + 4 2 5 2 6 2 3 2 3 3 3

>

>

>

2.2 Short Introduction to Mathematica


> 900 x y - 120 x y + 6 x y + 2500 x y - 2000 x y + 600 x y -

15

>

4 3 5 3 4 4 2 4 3 4 4 4 80 x y + 4 x y + 625 y - 500 x y + 150 x y - 20 x y + x y )

In[12]:= InputForm[%7] Out[12]//InputForm= (12*x^2 - 12*x^2*Cos[2*x])/24 In[20]:= Plot[Sin[1/x], {x,0.01,Pi}] Out[20]= -GraphicsIn[42]:= Plot3D[x^2 + y^2, {x,-1,1}, {y,0,1}] Out[42]= -SurfaceGraphicsIn[43]:= f[x_,y_] := Sin[(x^2 + y^3)] / (x^2 + y^2) In[44]:= f[2,3] Sin[31] Out[44]= ------13 In[45]:= ContourPlot[x^2 + y^2, {x,-1,1}, {y,-1,1}] Out[45]= -SurfaceGraphicsIn[46]:= Plot3D[f[x,y], {x,-Pi,Pi}, {y,-Pi,Pi}, PlotPoints -> 30, PlotLabel -> "Sin[(x^2 + y^3)] / (x^2 + y^2)", PlotRange -> {-1,1}] Out[46]= -SurfaceGraphicsSin[(x^2 + y^3)] / (x^2 + y^2)
Sin[(x^2 + y^3)] / (x^2 + y^2)

1 0.5 0 -0.5 -1 -2 -1 0 1 2 -2 -1 0
-2 0

2 1
-1

-2

-1

In[47]:= ContourPlot[f[x,y], {x,-2,2}, {y,-2,2}, PlotPoints -> 30, ContourSmoothing -> True, ContourShading -> False, PlotLabel -> "Sin[(x^2 + y^3)] / (x^2 + y^2)"] Out[47]= -ContourGraphics-

16
In[52]:= Table[x^2, {x, 1, 10}] Out[52]= {1, 4, 9, 16, 25, 36, 49, 64, 81, 100} In[53]:= Table[{n, n^2}, {n, 2, 20}]

2 Computer Algebra

Out[53]= {{2, 4}, {3, 9}, {4, 16}, {5, 25}, {6, 36}, {7, 49}, {8, 64}, > {9, 81}, {10, 100}, {11, 121}, {12, 144}, {13, 169}, {14, 196}, > {15, 225}, {16, 256}, {17, 289}, {18, 324}, {19, 361}, {20, 400}} In[54]:= Transpose[%] Out[54]= {{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, > 20}, {4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, > 289, 324, 361, 400}} In[60]:= ListPlot[Table[Random[]+Sin[x/10], {x,0,100}]] Out[60]= -Graphics-

1.5 1 0.5

20 -0.5

40

60

80

100

In[61]:= x = Table[i, {i,1,6}] Out[61]= {1, 2, 3, 4, 5, 6} In[62]:= A = Table[i*j, {i,1,5}, {j,1,6}] Out[62]= {{1, 2, 3, 4, 5, 6}, {2, 4, 6, 8, 10, 12}, {3, 6, 9, 12, 15, 18}, > {4, 8, 12, 16, 20, 24}, {5, 10, 15, 20, 25, 30}} In[63]:= A.x Out[63]= {91, 182, 273, 364, 455} In[64]:= x.x Out[64]= 91 In[71]:= B = A.Transpose[A] Out[71]= {{91, 182, 273, 364, 455}, {182, 364, 546, 728, 910}, > {273, 546, 819, 1092, 1365}, {364, 728, 1092, 1456, 1820}, > {455, 910, 1365, 1820, 2275}} In[72]:= B - IdentityMatrix[5] Out[72]= {{90, 182, 273, 364, 455}, {182, 363, 546, 728, 910},

2.2 Short Introduction to Mathematica


> > {273, 546, 818, 1092, 1365}, {364, 728, 1092, 1455, 1820}, {455, 910, 1365, 1820, 2274}}

17

% last command %n nth last command ?f help for function f ??f more help for f f[x_,y_] := x^2 * Cos[y] dene function f (x, y ) a = 5 assign a constant to variable a f = x^2 * Cos[y] assign an expression to variable f (f is only a placeholder for the expression, not a function!) D[f[x,y],x] derivative of f with respect to x Integrate[f[x,y],y] antiderivative of f with respect to x Simplify[expr] simplies an expression Expand[expr] expand an expression Solve[f[x]==g[x]] solves an equation ^C cancel InputForm[Expr] converts into mathematica input form A TeXForm[Expr] converts into the L TEXform FortranForm[Expr] converts into the Fortran form CForm[Expr] converts into the C form ReadList["daten.dat", {Number, Number}] reads 2-column table from le Table[f[n], {n, n_min, n_max}] generates a list f (nmin ), . . . , f (nmax ) Plot[f[x],{x,x_min,x_max}] generates a plot of f ListPlot[Liste] plots a list Plot3D[f[x,y],{x,x_min,x_max},{y,y_min,y_max}] generates a three-dim. plot of f ContourPlot[f[x,y],{x,x_min,x_max},{y,y_min,y_max}] generates a contour plot of f Display["Dateiname",%,"EPS"] write to the le in PostScript format

Table 2.2: Mathematica some inportant commands

Example 2.5 (Calculation of Square Roots)


(*********** square root iterative **************) sqrt[a_,genauigk_] := Module[{x, xn, delta, n}, For[{delta=9999999; n = 1; x=a}, delta > 10^(-accuracy), n++, xn = x; x = 1/2(x + a/x); delta = Abs[x - xn]; Print["n = ", n, " x = ", N[x,2*accuracy], " delta = ", N[delta]]; ]; N[x,genauigk] ] sqrt::usage = "sqrt[a,n] computes the square root of a to n digits." Table[sqrt[i,10], {i,1,20}]

18
(*********** square root recursive **************) x[n_,a_] := 1/2 (x[n-1,a] + a/x[n-1,a]) x[1,a_] := a

2 Computer Algebra

2.3

Gnuplot, a professional Plotting Software

Gnuplot is a powerful plotting programm with a command line interface and a batch interface. Online documentation can be found on www.gnuplot.info.
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0 2 4 6 8 10 sin(x)

On the command line we can input plot [0:10] sin(x) to obtain the graph

Almost arbitrary customization of plots is possible via the batch interface. A simple batch le may contain the lines set terminal postscript eps color enhanced 26 set label "{/Symbol a}=0.01, {/Symbol g}=5" at 0.5,2.2 set output "bucket3.eps" plot [b=0.01:1] a=0.01, c= 5, (a-b-c)/(log(a) - log(b)) \ title "({/Symbol a}-{/Symbol b}-{/Symbol g})/(ln{/Symbol a} - ln{/Symbol b})"
8 7 6 ttot 5 4 3 2 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 =0.01, =5 (--)/(ln - ln)

producing a EPS le with the graph

3-dimensional plotting is also possible, e.g. with the commands


set isosamples 50 splot [-pi:pi][-pi:pi] sin((x**2 + y**3) / (x**2 + y**2))
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -3 -2

sin((x**2 + y**3) / (x**2 + y**2))

-1

which produces the graph

3 -3

-2

-1

2.4 Short Introduction to MATLAB

19

2.4

Short Introduction to MATLAB

Eective systems:
MATLAB & SIMULINK (MathWorks)

2.4.0.2

Some examples as jump start

Out(1)=3+2^3 ans = 11

Out(2)=sqrt(10) ans = 3.1623

Out(3)=vpa(sqrt(10),60) = 3.16227766016837933199889354443271853371955513932521682685750

syms x syms y y=x^2sin(x)^2 2 2 x sin(x) z=int(y,x) 2 2 3 x (- 1/2 cos(x) sin(x) + 1/2 x) - 1/2 x cos(x) + 1/4 cos(x) sin(x) + 1/4 x - 1/3 x Der=diff(z,x) 2 2 2 2 x (- 1/2 cos(x) sin(x) + 1/2 x) + x (1/2 sin(x) - 1/2 cos(x) + 1/2) 2 2 2 - 1/4 cos(x) + x cos(x) sin(x) - 1/4 sin(x) + 1/4 - x Simple=simplify(Der) 2 2 x sin(x) Series=Taylor(exp(x),6,x,0) 2 3 4 5 1 + x + 1/2 x + 1/6 x + 1/24 x + 1/120 x (x+2)^2+((x+5)^2(x+y)^2)^3 2 (x + 2) Exp_Pol=expand(Pol) 2 6 5 4 2 3 3 4 + 4 x + x + 15625 x + 93750 x y + 234375 x y + 312500 x y 2 4 5 11 10 2 9 3 + 234375 x y + 93750 x y + 6 x y + 15 x y + 20 x y 8 4 7 5 6 6 10 9 2 8 3 + 15 x y + 6 x y + x y - 180 x y - 450 x y - 600 x y 7 4 6 5 6 12 11 10 9 - 450 x y - 180 x y + 15625 y + x - 30 x + 375 x - 2500 x 6 + (x - 5) 6 (x + y)

>

>

>

20

2 Computer Algebra

>

8 7 5 6 9 8 2 7 3 + 9375 x - 18750 x - 30 x y+ 2250 x y + 5625 x y + 7500 x y 6 4 5 5 4 6 8 7 2 + 5625 x y + 2250 x y + 375 x y - 15000 x y - 37500 x y 6 3 5 4 4 5 3 6 7 - 50000 x y - 37500 x y - 15000 x y - 2500 x y + 56250 x y 6 2 5 3 4 4 3 5 + 140625 x y + 187500 x y + 140625 x y + 56250 x y 2 6 6 5 2 4 3 + 9375 x y - 112500 x y - 281250 x y - 375000 x y 3 - 281250 x 4 y 2 - 112500 x 5 y 6 - 18750 x y

>

>

>

>

>

t=0:0.01:pi plot(sin(1./t)) --Plot Mode--[X,Y]=meshgrid(-1:0.01:1,-1:0.01:1) Z=sin(X.^2+Y.^3)/(X.^2+Y.^2) surf(X,Y,Z)

x=1:1:10 y(1:10)=x.^2 y = [ 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

A_1=[1 2 4; 5 6 100; -10.1 23 56] A_1 = 1.0000 5.0000 -10.1000

2.0000 6.0000 23.0000

4.0000 100.0000 56.0000

A_2=rand(3,4)

2.4 Short Introduction to MATLAB

21

A_2 = 0.2859 0.5437 0.9848 A_2= 0.3077 0.3625 0.6685 0.5598 A_1.*A_2= 3.1780 43.5900 26.3095

0.7157 0.8390 0.4333

0.4706 0.5607 0.2691

0.7490 0.5039 0.6468

0.1387 0.7881 0.1335 0.3008

0.4756 0.7803 0.0216 0.9394

5.9925 94.5714 57.1630

5.0491 92.6770 58.7436

3.0975 29.3559 17.5258

[U L]=lu(A_1) U = -0.0990 -0.4950 1.0000 L = -10.1000 0 0

0.2460 1.0000 0

1.0000 0 0

23.0000 17.3861 0

56.0000 127.7228 -21.8770

[Q R]=qr(A_1) Q = -0.0884 -0.4419 0.8927 R = -11.3142 0 0 b=[1;2;3] x=A_1\b b = 1 2 3 x = 0.3842 0.3481 -0.0201 A_3=[1 2 3; -1 0 5; 8 9 23] A_3 = 1 -1

-0.2230 -0.8647 -0.4501

0.9708 -0.2388 -0.0221

17.7035 5.4445 -15.9871 -112.5668 0 -21.2384

2 0

3 5

22
8 9 23

2 Computer Algebra

Inverse=inv(A_3) Inverse = -0.8333 1.1667 -0.1667

-0.3519 -0.0185 0.1296

0.1852 -0.1481 0.0370

Example 2.6 (Calculation of Square Roots)


(*********** root[2] iterative **************) function [b]=calculate_Sqrt(a,accuracy) clc; x=a; delta=inf; while delta>=10^-(accuracy) Res(n)=x; xn=x; x=0.5*(x+a/x); delta=abs(x-xn); end b=Res;

2.5

Short Introduction to GNU Octave

From the Octave homepage: GNU Octave is a high-level interpreted language, primarily intended for numerical computations. It provides capabilities for the numerical solution of linear and nonlinear problems, and for performing other numerical experiments. It also provides extensive graphics capabilities for data visualization and manipulation. Octave is normally used through its interactive command line interface, but it can also be used to write non-interactive programs. The Octave language is quite similar to Matlab so that most programs are easily portable.

Downloads, Docs, FAQ, etc.:


http://www.gnu.org/software/octave/

Nice Introduction/Overview:
http://math.jacobs-university.de/oliver/teaching/iub/resources/octave/octave-intro/octaveintro.html

Plotting in Octave:
http://www.gnu.org/software/octave/doc/interpreter/Plotting.html

2.5 Short Introduction to GNU Octave


// -> comments

23

BASICS ====== octave:47> 1 + 1 ans = 2 octave:48> x = 2 * 3 x = 6 // suppress output octave:49> x = 2 * 3; octave:50> // help octave:53> help sin sin is a built-in function -- Mapping Function: sin (X) Compute the sine for each element of X in radians. ...

VECTORS AND MATRICES ==================== // define 2x2 matrix octave:1> A = [1 2; 3 4] A = 1 2 3 4 // define 3x3 matrix octave:3> A = [1 2 3; 4 5 6; 7 8 9] A = 1 2 3 4 5 6 7 8 9 // access single elements octave:4> x = A(2,1) x = 4 octave:17> A(3,3) = 17 A = 1 2 3 4 5 6 7 8 17 // extract submatrices octave:8> A A = 1 2 3

24
4 7 5 8 6 17

2 Computer Algebra

octave:9> B = A(1:2,2:3) B = 2 3 5 6 octave:36> b=A(1:3,2) b = 2 5 8 // transpose octave:25> A ans = 1 4 7 2 5 8 3 6 17 // determinant octave:26> det(A) ans = -24.000 // solve Ax = b // inverse octave:22> inv(A) ans = -1.54167 0.41667 1.08333 0.16667 0.12500 -0.25000 // define vector b octave:27> b = [3 7 12] b = 3 7 12 // solution x octave:29> x = inv(A) * b x = -0.20833 1.41667 0.12500 octave:30> A * x ans = 3.0000 7.0000 12.0000

0.12500 -0.25000 0.12500

2.5 Short Introduction to GNU Octave

25

// try A\b // illegal operation octave:31> x * b error: operator *: nonconformant arguments (op1 is 3x1, op2 is 3x1) // therefore allowed octave:31> x * b ans = 10.792 octave:32> x * b ans = -0.62500 -1.45833 4.25000 9.91667 0.37500 0.87500

-2.50000 17.00000 1.50000

// elementwise operations octave:11> a = [1 2 3] a = 1 2 3 octave:10> b = [4 5 6] b = 4 5 6 octave:12> a*b error: operator *: nonconformant arguments (op1 is 1x3, op2 is 1x3) octave:12> a.*b ans = 4 10 18 octave:23> A = [1 2;3 4] A = 1 2 3 4 octave:24> A^2 ans = 7 10 15 22 octave:25> A.^2 ans = 1 4 9 16 // create special vectors/matrices octave:52> x = [0:1:5] x = 0 1 2 3 4 5 octave:53> A = zeros(2) A = 0 0

26
0 0

2 Computer Algebra

octave:54> A = zeros(2,3) A = 0 0 0 0 0 0 octave:55> A = ones(2,3) A = 1 1 1 1 1 1 octave:56> A = eye(4) A = Diagonal Matrix 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 octave:57> B = A * 5 B = Diagonal Matrix 5 0 0 0 0 5 0 0 0 0 5 0 0 0 0 5 // vector/matrix size octave:43> size(A) ans = 3 3 octave:44> size(b) ans = 3 1 octave:45> size(b)(1) ans = 3

PLOTTING (2D) ============ octave:35> octave:36> octave:37> octave:38> octave:39> x = [-2*pi:0.1:2*pi]; y = sin(x); plot(x,y) z = cos(x); plot(x,z)

// two curves in one plot octave:40> plot(x,y) octave:41> hold on octave:42> plot(x,z) // reset plots

2.5 Short Introduction to GNU Octave


octave:50> close all // plot different styles octave:76> plot(x,z,r) octave:77> plot(x,z,rx) octave:78> plot(x,z,go) octave:89> close all // manipulate plot octave:90> hold on octave:91> x = [-pi:0.01:pi]; // another linewidth octave:92> plot(x,sin(x),linewidth,2) octave:93> plot(x,cos(x),r,linewidth,2) // define axes range and aspect ratio octave:94> axis([-pi,pi,-1,1], equal) -> try square or normal instead of equal (help axis) // legend octave:95> legend(sin,cos) // set parameters (gca = get current axis) octave:99> set(gca,keypos, 2) // legend position (1-4) octave:103> set(gca,xgrid,on) // show grid in x octave:104> set(gca,ygrid,on) // show grid in y // title/labels octave:102> title(OCTAVE DEMO PLOT) octave:100> xlabel(unit circle) octave:101> ylabel(trigon. functions) // store as png octave:105> print -dpng demo_plot.png

27

DEFINE FUNCTIONS

28
================ sigmoid.m: --function S = sigmoid(X) mn = size(X); S = zeros(mn); for i = 1:mn(1) for j = 1:mn(2) S(i,j) = 1 / (1 + e ^ -X(i,j)); end end end --easier: --function S = sigmoid(X) S = 1 ./ (1 .+ e .^ (-X)); end ---

2 Computer Algebra

octave:1> sig + [TAB] sigmoid sigmoid.m octave:1> sigmoid(10) ans = 0.99995 octave:2> sigmoid([1 10]) error: for x^A, A must be square // (if not yet implemented elementwise) error: called from: error: /home/richard/faculty/adv_math/octave/sigmoid.m at line 3, column 4 ... octave:2> sigmoid([1 10]) ans = 0.73106 0.99995 octave:3> x = [-10:0.01:10]; octave:5> plot(x,sigmoid(x),linewidth,3);

PLOTTING (3D) ============ // meshgrid octave:54> X = 1 2 1 2 1 2 Y = 1 2 3 [X,Y] = meshgrid([1:3],[1:3]) 3 3 3

1 2 3

1 2 3

// meshgrid with higher resolution (suppress output) octave:15> [X,Y] = meshgrid([-4:0.2:4],[-4:0.2:4]);

2.5 Short Introduction to GNU Octave


// function over x and y, remember that cos and sin // operate on each element, result is matrix again octave:20> Z = cos(X) + sin(1.5*Y); // plot octave:21> mesh(X,Y,Z) octave:22> surf(X,Y,Z)

29

octave:44> contour(X,Y,Z) octave:45> colorbar octave:46> pcolor(X,Y,Z)

RANDOM NUMBERS / HISTOGRAMS =========================== // equally distributed random numbers octave:4> x=rand(1,5) x = 0.71696 0.95553

0.17808

0.82110

0.25843

octave:5> x=rand(1,1000); octave:6> hist(x); // normally distributed random numbers octave:5> x=randn(1,1000); octave:6> hist(x);

30
// try octave:5> x=randn(1,10000); octave:6> hist(x, 25);

2 Computer Algebra

2.6

Exercises

Mathematica
Exercise 2.1 Program the factorial function with Mathematica. a) Write an iterative program that calculates the formula n! = n (n 1) . . . 1. b) Write a recursive program that calculates the formula n! = n (n 1)! if n > 1 1 if n = 1

analogously to the root example in the script. Exercise 2.2 a) Write a Mathematica program that multiplies two arbitrary matrices. Dont forget to check the dimensions of the two matrices before multiplying. The formula is
n

Cij =
k=1

Aik Bkj .

Try to use the functions Table, Sum and Length only. b) Write a Mathematica program that computes the transpose of a matrix using the Table function. c) Write a Mathematica Program that computes the inverse of a matrix using the function Linear Solve.

MATLAB
Exercise 2.3 i a) For a nite geometic series we have the formula n i=0 q = function that takes q and n as inputs and returns the sum.
1q n+1 . 1q

Write a MATLAB

1 i b) For an innite geometic series we have the formula i=0 q = 1q if the series converges. Write a MATLAB function that takes q as input and returns the sum. Your function should produce an error if the series diverges.

2.6 Exercises

31

Exercise 2.4 a) Create a 5 10 random Matrix A. b) Compute the mean of each column and assign the results to elements of a vector called avg. c) Compute the standard deviation of each column and assign the results to the elements of a vector called s. Exercise 2.5 Given the row vectors x = [4, 1, 6, 10, 4, 12, 0.1] and y = [1, 4, 3, 10, 9, 15, 2.1] compute the following arrays, a) aij = xi yj b) bij = d) dij =
xi yj

c) ci = xi yi , then add up the elements of c using two dierent programming approaches.


xi 2+xi +yj

e) Arrange the elements of x and y in ascending order and calculate eij being the reciprocal of the less xi and yj . f ) Reverse the order of elements in x and y in one command. Exercise 2.6 number. Write a MATLAB function that calculates recursively the square root of a

Analysis Repetition
Exercise 2.7 In a bucket with capacity v there is a poisonous liquid with volume v . The bucket has to be cleaned by repeatedly diluting the liquid with a xed amount ( )v (0 < < 1 ) of water and then emptying the bucket. After emptying, the bucket always keeps v of its liquid. Cleaning stops when the concentration cn of the poison after n iterations is reduced from 1 to cn < > 0. a) Assume = 0.01, = 1 and = 109 . Compute the number of cleaning-iterations. b) Compute the total volume of water required for cleaning. c) Can the total volume be reduced by reducing ? If so, determine the optimal . d) Give a formula for the time required for cleaning the bucket. e) How can the time for cleaning the bucket be minimized?

Chapter 3 Calculus Selected Topics

3.1

Sequences and Convergence

Denition 3.1 A function N R, n an is called sequence. Notation: (an )nN or (a1 , a2 , a3 , ...) Example 3.1 (1, 2, 3, 4, ...) = (n)nN 1 1 1 1 (1, 2 , 3 , 4 , ...) = ( n )nN (1, 2, 4, 8, 16, ...) = (2n1 )nN Consider the following sequences: 1. 1,2,3,5,7,11,13,17,19,23,... 2. 1,3,6,10,15,21,28,36,45,55,66,.. 3. 1,1,2,3,5,8,13,21,34,55,89,... 4. 8,9,1,-8,-10,-3,6,9,4,-6,-10 5. 1,2,3,4,6,7,9,10,11,13,14,15,16,17,18,19,21,22,23,24,26,27,29,30,31,32,33,34,35,36, 37,.. 6. 1,3,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,31,33, 35, 37,38,39,41,43,.. Find the next 5 elements of each sequence. If you do not get ahead or want to solve other riddles additionaly, have a look at http://www.oeis.org.

3.1 Sequences and Convergence

33

Denition 3.2 (an )nN is called bounded, if there is A, B R with n A an B (an )nN is called monotonically increasing/decreasing, i n an+1 an an ) (an+1

Denition 3.3 A sequence of real numbers (an )nN converges to a R, i: > 0 N () N, Notation: lim an = a
n

so that |an a| < n N ()

an

{ {
N( ) n

Denition 3.4 A sequence is called divergent if it is not convergent. Example 3.2 1.) 2.) 3.) 4.)
1 1 (1, 2 , 3 , ...) converges to 0 (zero sequence) (1, 1, 1, ...) converges to 1 (1, 1, 1, 1, ...) is divergent (1, 2, 3, ...) is divergent

Theorem 3.1 Every convergent sequence is bounded. Proof: for = 1 : N (1), rst N (1) terms bounded, the rest bounded through a N (1). Note: Not every bounded sequence does converge! (see exercise 3), but: Theorem 3.2 Every bounded monotonic sequence is convergent

34

3 Calculus Selected Topics

3.1.1

Sequences and Limits


n n

Let (an ), (bn ) two convergent sequences with: lim an = a, lim bn = b , then it holds: lim (an bn ) = = lim (c an ) = n = lim (an bn ) =
n

lim an lim bn
n

ab c lim an n ca ab a b if 1 n
n

lim

an bn

bn , b = 0 , n N converges: 1000 10000 2.717 2.7181

Example 3.3 Show that the sequence an = 1 + n an 1 2 2 2.25 3 2.37

4 10 100 2.44 2.59 2.705

The numbers (only) suggest that the sequence converges. 1. Boundedness: n an > 0 and an = 1+ 1 n
n

= 1+n =

< < < = =

1 n(n 1) 1 n(n 1)(n 2) 1 1 + 2+ 3 + ... + n n 2 n 23 n n 1 1 1 2 1 1 1 1 + 1 1 + ... + 1 1+1+ 2 n 23 n n n! n n1 ... 1 n 1 1 1 1+1+ + + ... + 2 23 n! 1 1 1 1 1 + 1 + + + + ... + n 2 4 8 2 1 1 1 1 + 1 + + + + ... 2 4 8 1 1+ 1 1 2 3

2 ... n

2. Monotony: Replacing n by n + 1 in (1.) gives summands in an+1 are bigger! The limit of this sequence is the Euler number: e := lim
n

an < an+1 , since in line 3 most

1+

1 n

= 2.718281828 . . .

3.2

Series

3.2 Series

35

Denition 3.5 Let (an )nN be a sequence of real numbers. The sequence
n

sn :=
k=0

ak

,n N

of the partial sums is called (innite) series and is dened by


k=0

ak .

If (sn )nN converges, we dene


n

ak := lim
k=0

ak .
k=0

Example 3.4 n Sequence an


n

0 1 0 1 ak 0 1 1 1 2 3 2 2 1 4 7 4

2 3 2 3 3 6 3 1 8 15 8

4 4

5 5

6 6

7 7

8 8

9 9

10 10

... ...

Series Sn =
k=0

10 15 21 28 4 1 16 31 16 5 1 32 63 32

36 45 55 . . . 6 1 64 127 64 7 1 128 255 128 8 1 256 511 256 9 1 518 1023 512 10 1 1024 2047 1024

n Sequence an Series Sn (decimal)

0 1 1

1 1.5 1.75 1.875 1.938 1.969 1.984

1.992 1.996

1.998 1.999

3.2.1

Convergence criteria for series

Theorem 3.3 (Cauchy) The series


n=0

an converges i
n

> 0 N N
k=m

ak <

for all n m N
p n

Proof: Let sp :=
k=0

ak . Then sn sm1 =
k=m

ak . Therefore (sn )nN Cauchy sequence

(sn ) is convergent. Theorem 3.4 A series with ak > 0 f or is bounded.

k 1 converges i the sequence of partial sums

36 Proof: as exercise Theorem 3.5 (Comparison test)

3 Calculus Selected Topics

Let
n=0

cn a convergent series with n cn 0 and (an )nN a sequence with |an | cn

N. Then
n=0

an converges.

Theorem 3.6 (Ratio test)

Let
n=0

an a series with an = 0 for all n n0 . A real number q with 0 < q < 1 exists, that

an+1 q for all n n0 . Then the series an If, from an index n0 ,


an+1 an

an converges.
n=0

1, then the series is divergent.

Proof idea (f. 1. Part): Show that Example 3.5


n=0

|a0 |q n is a majorant. n2 2n converges. 1 8 1 (1 + )2 = < 1. 2 3 9

n=0

Proof:

(n + 1)2 2n 1 1 an+1 = = (1 + )2 n +1 2 an 2 n 2 n

for n 3

3.2.2

Power series

Theorem 3.7 and dention For each x R the power series

exp(x) :=
n=0

xn n!

is convergent. Proof: The ratio test gives an+1 xn+1 n! |x| 1 = = n an (n + 1)!x n+1 2

f or

n 2|x| 1

Denition 3.6 Eulers number e := exp(1) = The function exp : R R+

1 n! n=0 x exp(x) is called exponential function.

3.3 Continuity

37

Theorem 3.8 (Remainder)


N

exp(x) =
n=0

xn + RN (x) n! |x| 1 + N 2

N th approximation or N 2(|x| 1)

with |RN (x)| 2

|x| (N + 1)!
N +1

f or

3.2.2.1
N

Practical computation of exp(x) :

n=0

xn x2 xN 1 xN = 1+x+ + ... + + n! 2 (N 1)! N ! x x x x = 1 + x(1 + (1 + . . . + (1 + (1 + )) . . .)) 2 N 2 N 1 N 1 1 1 1 e = 1 + 1 + (. . . + (1 + (1 + )) . . .) + RN 2 N 2 N 1 N

with RN

2 (N + 1)!

2 For N = 15: |R15 | 16! < 1013 e = 2.718281828459 2 1012 (rounding error 5 times 1013 !)

Theorem 3.9 The functional equation of the exponential function x, y R it holds: exp(x + y ) = exp(x) exp(y ). Proof: The proof of this theorem is via the series representation (denition 3.6). It is not easy, because it requires another theorem about the product of series (not covered here). Conclusions: 1 a) x R exp(x) = (exp(x))1 = exp(x) b) x R exp(x) > 0 c) n Z exp(n) = en Notation: Also for real numbers x R : ex := exp(x) Proof: 1 x=0 a) exp(x) exp(x) = exp(x x) = exp(0) = 1 exp(x) = exp(x) x2 b) 1.Case x 0 : exp(x) = 1 + x + + ... 1 > 0 2 1 2.Case x < 0 : x < 0 exp(x) > 0 exp(x) = > 0. exp(x) c) Induction exp (1) = e exp (n) = exp (n 1 + 1) = exp (n 1) e = en1 e Note: for large x := n + h n N exp(x) = exp(n + h) = en exp(h) (for large x faster then series expansion)

3.3

Continuity

Functions are characterized among others in terms of smoothness. The weakest form of smoothness is the continuity.

38 Denition 3.7 Let D R,

3 Calculus Selected Topics f : D R a function and a R. We write


xa

lim f (x) = C,
n

if for each sequence (xn )nN , (xn ) D with lim xn = a holds:


n

lim f (xn ) = C.

f(x)

C . f(x 2 ) f(x 1 )

. .

.
x1 x2 x
3 ......

Denition 3.8 For x R the expression x denotes the unique integer number n with n x < n + 1. Example 3.6 1. lim exp(x) = 1
x0

2. lim x does not exist! x1 left-side limit = right-side limit


4 3 2 1

11 00 00 11 00 11 00 11

11 00 00 11 00 11 . 00 11
1 2 3 4

3. Let f : R R polynomial of the form f (x) = xk + a1 xk1 + . . . + ak1 x + ak , Then it holds: lim f (x) =
x

k 1.

and Proof: for x = 0

lim f (x) =

, if , if

k k

even odd

f (x) = xk (1 +

a1 a2 ak + 2 + ... + k) x x x
=:g (x)

3.3 Continuity since lim g (x) = 0, it follows lim f (x) = lim xk = .


x x x

39

Application: The asymptotic behavior for x of polynomials is always determinated by the highest power in x. Denition 3.9 (Continuity) Let f : D R a function and a D. The function f is called continuous at point a, if
xa

lim f (x) = f (a).

f is called continuous in D, if f is continuous at every point of D.


f(x) f(a )

. . .

f(x 2 ) f(x 1 )

For the depicted function it holds lim f (x) = a. f is discontinuous at the x point a.

x1

x2

......

Example 3.7 1.) f : x c (constant function) is continuous on whole R. 2.) The exponential function is continuous on whole R. 3.) The identity function f : x x is continuous on whole R. Theorem 3.10 Let f, g : D R functions, that are at a D continuous and let r R. f Then the functions f + g, rf, f g at point a are continuous, too. If g (a) = 0, then is g continuous at a. Proof: Let (xn ) a sequence with (xn ) D and lim xn = a. n to show : lim (f + g )(xn ) = (f + g )(a) n lim (rf )(xn ) = (rf )(a) n lim (f g )(xn ) = (f g )(a) holds because of rules f or sequences. n f f = ( g )(a) lim ( )(xn ) n g Denition 3.10 Let A, B, C subsets of R with the functions f : A B and g : B C . Then g f : A C, x g (f (x)) is called the composition of f and g . 1.) f g ( x) = sin(x) = Example 3.8 2.) 3.) sin ( x) = f (g (x)) sin(x) sin( x)

40

3 Calculus Selected Topics

Theorem 3.11 Let f : A B continuous at a A and g : A C continuous at y = f (a). Then the composition g f is continuous in a, too. Proof: to show: lim xn = a
n

continuity of f

lim f (xn ) = f (a)

continuity of g

lim g (f (xn )) = g (f (a)).

x Example 3.9 2 is continuous on whole R, because f (x) = x2 , g (x) = f (x) + a x +a x h(x) = are continuous. g ( x) Theorem 3.12 ( Denition of Continuity) A function f : D R is continuous at x0 D i: > 0 > 0 x D (|x x0 | < |f (x) f (x0 )| < )

and

f(x) f(x0)

f(x) 2 1 = 2 1

.
x0 x

x
0

Theorem 3.13 Let f : [a, b] R continuous and strictly increasing (or decreasing) and A := f (a), B := f (b). Then the inverse function f 1 : [A, B ] R (bzw. [B, A] R) is continuous and strictly increasing (or decreasing), too.

Example 3.10 (Roots) + Let k N, k 2. The function f : R+ R , x xk is continuous and strictly increasing. The inverse function f 1 : R+ R+ , x k x is continuous and strictly increasing.

3.3 Continuity

41

Theorem 3.14 (Intermediate Value) Let f : [a, b] R continuous with f (a) < 0 and f (b) > 0. Then there exists a p [a, b] with f (p) = 0.

f(x)

f(x)

f discontinuous, no zero!,

Note: if f (a) > 0, f (b) < 0 take f instead of f and apply the intermediate value theorem. Example 3.11 D = Q : f (p) = 0. x x2 2 = f (x) f (1) = 1, f (2) = 2 there is a p D with

Corollar 3.3.1 Is f : [a, b] R continuous and y is any number between f (a) and f (b), then there is at least one x [a, b] with f (x) =y .
f(b) y f(a)

Note: Now it is clear that every continuous function on [a, b] assumes every value in the interval [f (a), f (b)].

3.3.1

Discontinuity

Denition 3.11 We write lim f (x) = c (lim f (x) = c), if for every sequence (xn ) with
x a x a

xn > a (xn < a) and lim xn = a holds: lim f (xn ) = c.


x n x

lim f (x) (lim f (x)) is called right-side (left-side) limit of f at x = a.


a x a

Theorem 3.15 A function is continuous at point a, if the right-side and left-side limit are equal. Lemma 3.1 A function is discontinuous at the point a, if limit lim f (x) does not exist.
xa

42

3 Calculus Selected Topics

Conclusion: A function is discontinuous at the point a, if there are two sequences (xn ), (zn ) with lim xn = lim zn = a and lim f (xn ) = lim f (zn ). Example 3.12 1. Step: lim f (x) = c1 = c2 = lim f (x)
x a x a

f (x) = x n f or

1 2

x<n+
f(x) 1 1

1 2

nZ

.
xx0 xx0

-2

-1

1 2

2. Pole: lim f (x) = or lim f (x) = Example: f (x) = 1 x2 x = 0 is discontinuous at x = 0


sin(1/x)

3. Oscillation: 1 The function f (x) = sin , x


1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1 -0.5

0.5

Proof: sin

let

xn =

1 1 = 1 lim xn = 0, lim sin =1 n n xn xn 1 but: let zn = , nN n 1 lim zn = 0, lim sin =0 n n zn 1 is discontinuous. Limit is not unique, therefore sin x Note: Is a function f continuous x [a, b], then it holds for any convergent sequence (xn ) : lim f (xn ) = f ( lim xn ).
n n

1 + n 2

nN

Proof: as exercise Conclusion: Continuity of f at x0 = lim xn means that f and lim can be exchanged.
n n

3.4

TaylorSeries

The Taylor series is a representation of a function as an innite sum of powers of x. Goals:

3.4 TaylorSeries 1. Simple representation of functions as polynomials, i.e.: f (x) a0 + a1 x + a2 x2 + a3 x3 + + an xn 2. Approximation of functions in the neighborhood of a point x0 . Ansatz: P (x) = a0 + a1 (x x0 ) + a2 (x x0 )2 + a3 (x x0 )3 + + an (x x0 )n coecients a0 , , an are sought such that f (x) = P (x) + Rn (x) with a remainder term Rn (x) and limn Rn (x) = 0, ideally for all x. We require for some point x0 that f (x0 ) = P (x0 ), f (x0 ) = P (x0 ), , f (n) (x0 ) = P (n) (x0 ) Computation of Coecients: P (x0 ) = a0 , P (x0 ) = a1 , P (x0 ) = 2a2 , , f
(k)

43

P (k) (x0 ) = k !ak , (x0 ) k!

f (k) (x0 ) = k !ak ak = Result:

f (x) = f (x0 ) +

f ( x0 ) f (x0 ) f (n) (x0 ) (x x0 ) + (x x0 )2 + + (x x0 )n +Rn (x) 1! 2! n!


P (x)

Example 3.13 Expansion of f (x) = ex in the point x0 = 0: f (x0 ) = f (0) = 1, f (0) = 1, f (0) = 1, , f (n) = 1

ex = 1 + x +

x2 x3 xn + + + + Rn (x) 2! 3! n! 1+x+ 1+x+


x2 2 x2 2

6 5 4 3 2 1

x3 6

1+x

1
1 2

ex
2 1

44

3 Calculus Selected Topics

Theorem 3.16 (Taylor Formula) Let I R be an interval and f : I R a (n +1)times continuously dierentiable function. Then for x I, x0 I we have f (x) = f (x0 ) + with Rn (x) = f (x0 ) f ( x0 ) f (n) (x0 ) (x x0 ) + (x x0 )2 + + (x x0 )n + Rn (x) 1! 2! n! 1 n!
x

(x t)n f (n+1) (t)dt


x0

Theorem 3.17 (Lagrangian form of the remainder term) Let f : I R (n + 1) times continuously dierentiable and x0 , x I . Then there is a z between x0 and x such that f (n+1) (z ) (x x0 )n+1 . Rn (x) = (n + 1)! Example 3.14 f (x) = ex e =
k=0 x

Theorems 3.16 and 3.17 yield


n

xk ez + xn+1 k ! (n + 1)!
=Rn (x)

f or

|z | < |x|

Convergence: e|x| |x|n+1 =: bn |Rn (x)| (n + 1)! bn+1 |x| = 0 f or n bn n+2

the ratio test implies convergence of


n=0

bn . lim Rn (x) = 0 for all x R


n

lim bn = 0
n x

Thus the Taylor series for e converges to f (x) for all x R! Example 3.15 Evaluation of the integral 1
0

1 + x3 dx.

As the function f (x) = 1 + x3 has no simple antiderivative (primitive function), it can not be symbolically integrated. We compute an approximation for the integral by integrating the third order Taylor polynomial x3 1 + x3 = (1 + x3 )1/2 1 + 2 and substituting this into the integral
1 0

1+

x3

dx
0

x3 x4 1+ dx = x + 2 8

=
0

9 = 1.125 8

The exact value of the integral is about 1.11145, i.e. our approximation error is about 1%.

3.4 TaylorSeries

45

Denition 3.12 The series Tf (x) =

f (k) (x0 ) (x x0 )k is called Taylor series of f with k ! k=0 expansion point (point of approximation) x0 .

Note: 1. For x = x0 every Taylor series converges. 2. But for x = x0 not all Taylor series converge! 3. A Taylor series converges for exactly these x I to f (x) for which the remainder term from theorem 3.16 (3.17) converges to zero. 4. Even if the Taylor series of f converges, it does not necessarily converge to f . ( example in the exercises.) Example 3.16 (Logarithm series) For 0 < x 2: ln(x) = (x 1) Proof: ln (x) = 1 , x ln (x) = 1 , x2 ln (x) = 2 , x3 ln(4) (x) = 6 , x4 ln(n) (x) = (1)n1 n! xn+1 (n 1)! xn (x 1)2 (x 1)3 (x 1)4 + 2 3 4

Induction: ln(n+1) (x) = ln(x)(n) Expansion at x0 = 1

(1)(n1)

(n 1)! xn

= (1)n

Tln,1 (x) =
k=0

ln(k) (1) (x 1)2 (x 1)3 (x 1)4 (x 1)k = (x 1) + k! 2 3 4

This series converges only for 0 < x 2 (without proof).

Denition 3.13 If a Taylor series converges for all x in an interval I , we call I the convergence area. Is I = [x0 r, x0 + r] or I = (x0 r, x0 + r), r is the convergence radius of the Taylor series. Example 3.17 Relativistic mass increase: Einstein: total energy: E = mc2 kinetic energy: Ekin = (m m0 )c2 m(v ) = m0 1
v 2 c

46 to be shown: for v m v2 c we have Ekin 1 2 0 Ekin = (m m0 )c2 = 1 1


v 2 c

3 Calculus Selected Topics

1 m0 c2

3 1 2 1 1 1 2 x2 + = (1 x) 2 = 1 + x + 2 2! 1x 1 3 = 1 + x + x2 + 2 8

for x

1: Ekin 1+

1 1 1+ x 2 1x

1 v2 3 v4 1 2 2 m v + m0 + . . . 1 m c = 0 0 2 c2 2 8 c2

3.5

Dierential Calculus in many Variables


f : Rn R (x1 , x2 , , xn ) y = f (x1 , x2 , , xn )

or x y = f (x )

3.5.1

The Vector Space Rn

In order to compare vectors, we use a norm: Denition 3.14 Any mapping 1. x = 0 i x = 0 2. x = || x R, x Rn x, y Rn triangle inequation : Rn R, x x is called Norm if and only if

3. x + y x + y

the particular norm we will use here is the Denition 3.15 (Euklidian Norm) The function | | : Rn R+ {0}, x vector x . Lemma: Die Euklidian norm is a norm. Theorem 3.18 For x Rn we have x 2 = x x = |x |2 Proof as exercise. Note: The scalar product in Rn induces the Euklidian norm.

2 x2 1 + + xn is the Euklidian Norm of the

3.5 Dierential Calculus in many Variables

47

3.5.2

Sequences and Series in Rn

analogous to Sequences and Series in R! Denition 3.16 A mapping N Rn , n a n is called sequence. Notation: (a n )nN Example 3.18 3 4 5 1 2 1 , 1 , 1 , 1 , 1 , = 2 3 4 5 1 1 1 1 1 2 4 8 16

n
1 n 1 2n1


nN

Denition 3.17 A sequence (a n )nN of vectors a n Rn converges to a Rn , if > 0 N () N |a n a | < n N () Notation: lim a n = a
n

Theorem 3.19 A (vector) sequence (a n )nN converges to a if and only if all its coordinate sequences converge to the respective coordinates of a . (Proof as exercise.) Notation: ak 1 . (a k )kN a k Rn ak = . . ak n Note: Theorem 3.19 enables us to lift most properties of sequences of real numbers to sequences of vectors.

3.5.3

Functions from Rn to Rm
f : D B , x f (x ) x1 . . . f (x1 , , xn ) xn

m = 1 : Functions f from D Rn to B R have the form

Example 3.19 f (x1 , x2 ) = sin(x1 + ln x2 ) m = 1 : Functions f from D Rn to B Rm have the form f :DB , x f (x )

48 x1 f1 (x1 , , xn ) . . . . . . xn fm (x1 , , xn ) Example 3.20 1. f : R3 R2 x1 x1 x2 x3 x2 cos x1 + sin x2 x3

3 Calculus Selected Topics

2. Weather parameters: temperature, air pressure and humidity at any point on the earth f : [0 , 360 ] [90 , 90 ] [270, ] [0, ] [0, 100%] temperature(, ) airpressure(, ) humidity (, ) Note: The components f1 (x ), , fm (x ) can be viewed (analysed) independently. Thus, in the following we can restrict ourselves to f : Rn R. 3.5.3.1 Contour Plots

Denition 3.18 Let D R2 , B R, c B, f : D B . The set {(x1 , x2 )|f (x1 , x2 ) = c} is called contour of f to the niveau c. Example 3.21 f (x1 , x2 ) = x1 x2 x1 x2 = c for x1 = 0 : x2 = (hyperbolas) c=0
3

c x1

x1 = 0 x2 = 0

5 0
-1

-5
-2

0 -2 0
-3 -2 -1 0 1 2 3

-3

-2 2

ContourPlot[x y, {x,-3,3}, {y,-3,3}, Contours -> {0,1,2,3,4,5,6,7,8,9,-1, -2,-3,-4,-5,-6,-7,-8,-9}, PlotPoints -> 60]

Plot3D[x y, {x,-3,3}, {y,-3,3}, PlotPoints -> 30]

3.5 Dierential Calculus in many Variables

49

3.5.4

Continuity in Rn

analogous to continuity of functions in one variable: Denition 3.19 Let f : D Rm a function and a Rn . If there is a sequence (a n ) (maybe more than one sequence) with lim a n = a , we write
n

lim f (x ) = c , x a if for any sequence (x n ), x n D with lim x n = a :


n n

lim f (x n ) = c

Denition 3.20 (Continuity) Let f : D Rm a function and a D. The function f is continuous in a , if lim f (x ) = x a f (a ). f is continuous in D, if f is continuous in all points in D. Note: These denitions are analogous to the one-dimensional case. Theorem 3.20 If f : D Rm , g : D Rm , h : D R are continuous in x 0 D, then f f + g , f g , f g and h (if h(x 0 )) = 0 ) are continuous in x 0 .

3.5.5
3.5.5.1

Dierentiation of Functions in Rn
Partial Derivatives f : R2 R
3 f (x1 , x2 ) = 2x2 1 x2

Example 3.22

keep x2 = const., and compute the 1dim. derivative of f w.r.t. x1 : f (x1 , x2 ) = fx1 (x1 , x2 ) = 4x1 x3 2 x1 analogous with x1 = const. f 2 = 6x2 1 x2 x2 = 12x1 x2 2 = 12x1 x2 2

second derivatives:
x2 x1 (x1 , x2 ) x1 (x1 , x2 ) x2

f f = x1 x2 x2 x1

50 Example 3.23 (u, v, w) = uv + cos w u (u, v, w) = v v (u, v, w) = u w (u, v, w) = sin w

3 Calculus Selected Topics

f1 (x1 , , xn ) . . Denition 3.21 If f (x ) = is partially dierentiable in x = x 0 , i.e. . fm (x1 , , xn ) fi all partial Derivatives xk (x 0 )(i = 1, , m, k = 1, , n) exist, then the matrix f (x 0 ) = is called Jacobian matrix. Example 3.24 Linearisation of a function: f : R2 R3 in x 0 2x2 f (x ) = sin(x1 + x2 ) ln(x1 ) + x2 0 2
f1 (x 0 ) x1 f2 (x 0 ) x1 f1 (x 0 ) x2 f2 (x 0 ) x2

. . . fm (x 0 ) x1

. . . fm (x 0 ) x2

. . .

. . . fm (x 0 ) xn

f1 (x 0 ) xn f2 (x 0 ) xn

f (x ) = cos(x1 + x2 ) cos(x1 + x2 ) 1 1 x1
f(x o)+f(x o)(xx o) f(x)

1dimensional f (x0 ) = lim


xx0

f (x) f (x0 ) x x0
xo x

Linearisation g of f in x 0 = 0 g (x1 , x2 ) = f (, 0) + f (, 0) 0 0 2 g (x1 , x2 ) = 0 + 1 1 1 ln 1 x1 x2 0 2x2 = x1 x2 + x1 + x2 + ln 1

x1 x2

3.5 Dierential Calculus in many Variables

51

Note: For x x 0 i.e. close to x 0 the linearisation g is a good approximation to f (under which condition?). Example 3.25 We examine the function f : R2 R with f (x, y ) = xy 0
x2 + y 2

if (x, y ) = (0, 0) if (x, y ) = (0, 0)

Dierentiability: f is dierentiable on R2 \{(0, 0)} since it is built up of dierentiable functions by sum, product and division. f (x, y ) = x y x2 + y 2 x2 y (x2 + y 2 ) 2
3

f y (0, y ) = =1 x y f (x, 0) = 0 x f f lim (0, y ) = lim (x, 0) y 0 x x0 x the partial derivative . Symmetries: 1. f is symmetric wrt. exchange of x and y , i.e. w.r.t. the plane y = x. 2. f is symmetric wrt. exchange of x and y , i.e. w.r.t. the plane y = x. 3. f (x, y ) = f (x, y ), d.h. f is symmetric w.r.t. the y -axis. 4. f (x, y ) = f (x, y ), d.h. f is symmetric w.r.t. the x-axis. Contours: xy x2 + y 2 Contours: y=
cx x2 c2 xcx 2 c2 f x

is not continuous in (0, 0). f is in (0, 0) not dierentiable

=c

xy = c

x2 + y 2

x 2 y 2 = c2 ( x 2 + y 2 ) cx c2

y 2 (x2 c2 ) = c2 x2

y =

x2

if c > 0, x > 0 (1. Quadr.) and c < 0, x < 0 (2. Quadr.) if c > 0, x < 0 (3. Quadr.) and c < 0, x > 0 (4. Quadr.)

Signs in the quadrants: - + + c c

f(x,y)=c

52

3 Calculus Selected Topics

2
0

0 -2 2 -4 -2 0 0 2 -4 4
-3 -2 -1 0 1 2 3

-1

-2

-2

-3

Continuity: f is continuous on R2 \{(0, 0)}, since it is built up of continuous functions by sum, product and division. Continuity in (0, 0): Let > 0 such that |x | = , i.e. = f (x, y ) = from |x| we get |f (x, y )| |x| = and Thus f is continuous in (0, 0). lim f (x, y ) = 0 x 0 x x2 + y 2 y = 2 x2 1 x2 /2 = x 1 x2 /2

x 2 x2 =

3.5.5.2

The Gradient

Denition 3.22 f : D R(D Rn ) f

( x ) x1 . . The Vector gradf (x ) := f (x )T = is called gradient of f . . f (x ) xn The gradient of f points in the direction of the steepest ascent of f .

3.5 Dierential Calculus in many Variables Example 3.26


y 3

53

f (x, y ) = x2 + y 2
1

f (x, y ) = 2x x gradf (x, y ) =

f (x, y ) = 2y y 2x 2y =2 x y

x y 3 1 1

x 1 3 x

z=3 z=1

3.5.5.3

Higher Partial Derivatives


f (x) xi

Let f : D Rm (D Rn ). Thus xk is well dened.

is again a function mapping from D to Rm and 2f (x ) = f xi ,xk (x ) xk xi

f xi

(x ) =:

Theorem 3.21 Let D Rn open and f : D Rm two times partially dierentiable. Then we have for all x 0 D and all i, j = 1, , n 2f 2f (x 0 ) = (x 0 ) xi xj xj xi Consequence: If f : D Rn (D Rn open) is ktimes continuously partially dierentiable, then kf kf = xik xik1 xi1 xi(k) . . . xi(1) for any Permutation of the numbers 1, . . . , k . 3.5.5.4 The Total Dierential

If f : Rn R is dierentiable, then the tangential mapping ft (x ) = f (x 0 ) + f (x 0 )(x x 0 ) represents a good approximation to the function f in the neighborhood of x 0 which can be seen in ft (x ) f (x 0 ) = f (x 0 )(x x 0 ). With df (x ) := ft (x ) f (x 0 ) f (x ) f (x 0 ) and dx1 . dx = . . := x x 0 dxn

54 we get: df (x ) = f (x 0 )dx or df (x ) =
k=1 n

3 Calculus Selected Topics

f f f (x 0 )dxk = (x 0 )dx1 + + dxn xk x1 xn


n

Denition 3.23 The linear mapping df =


k=1

f (x 0 )dxk is called total dierential xk

of the function f in x 0 . Note: Since in a neighborhood of x 0 , ft is a good approximation of the function f , we have for all x close to x 0 : df (x ) f (x ) f (x 0 ). Thus df (x ) gives the approximate deviation of the function value f (x ) from f (x 0 ), when x deviates from x 0 a little bit. 3.5.5.5 Application: The Law of Error Propagation

Example 3.27 For a distance of s = 10 km a runner needs the time of t = 30 min yielding = 20 km . Let the measurement error for the distance s be an average speed of v = s t h s = 1 m and for the time we have t = 1 sec. Give an upper bound on the propagated error v for the average speed! This can be solved as follows. To the given measurements x1 , , xn , a function f : Rn R has to be applied. The measurement error for x1 , , xn is given as x1 , , xn (xi > 0 i = 1, , n). The law of error propagation gives as a rough upper bound for the error f (x ) of f (x1 , , xn ) the assessment f (x1 , , xn ) < f f (x ) x1 + . . . + (x ) xn x1 xn

Denition 3.24 We call fmax (x1 , , xn ) := the maximum error of f . The ratio f f (x ) x1 + . . . + (x ) xn x1 xm
fmax (x ) f (x )

is the relative maximum error.

Note: fmax typically gives a too high estimate for the error of f , because this value only occurs if all measurement errors dx1 , , dxn add up with the same sign. This formula should be applied for about n 5.

3.5 Dierential Calculus in many Variables

55

Denition 3.25 When the number of measurements n becomes large, a better estimate for the error f is given by the formula fmean (x1 , , xn ) := for the mean error of f . f (x ) x1
2

x1 + . . . +

f (x ) xm

xn

Example 3.28 Solution fo example 3.27. Application of the maximum error formula leads to v (s, t) = v 1 v s s s (s, t) s + (s, t) t = s + 2 t = + 2 t s t t t t t 10 km 1 40 km km 0.001 km + h = 0.002 + = 0.013 = 2 0.5 h 0.25 h 3600 3600 h h

. This can be compactly written as the result v = (20 0.013) km h Denition 3.26 Let f : D R two times continuously dierentiable. The n nMatrix 2f 2f ( x ) . . . ( x ) 2 x1 xn x1 . . . . (Hessf )(x ) := . . 2f 2f (x ) . . . (x ) xn x1 x2
n

is the HesseMatrix of f in x . Note: Hessf is symmetric, since 2f 2f = xi xj xj xi

3.5.6

Extrema without Constraints

Again we appeal to your memories of onedimensional analysis: How do you determine extrema of a function f : R R? This is just a special case of what we do now.

56

3 Calculus Selected Topics

Denition 3.27 Let D Rn and f : D R a function. A point x D is a local maximum (minimum) of f , if there is a neighborhood U D of x such that f (x ) f (y ) (f (x ) f (y )) y U.

Analogously, we have an isolated local Maximum (Minimum) in x , if there is a neighborhood U D of x such that f (x ) > f (y ) (bzw. f (x ) < f (y )) y U, y =x

All these points are called extrema. If the mentioned neighborhood U of an extremum is the whole domain, i.e. U = D, then the extremum is global. Give all local, global, isolated and non-isolated maxima and minima of the function shown in the following graphs:
4

10 7.5 5 2.5 0 -4 -2 0 2 4 -4 -2 0 4

-2

2
-4

-4

-2

Plot3D[f[x,y], {x,-5,5},{y,-5,5}, PlotPoints -> 30]

ContourPlot[f[x,y], {x,-5,5},{y,-5,5}, PlotPoints -> 60, ContourSmoothing -> True,ContourShading-> False]

Theorem 3.22 Let D Rn be open and f : D R partially dierentiable . If f has a local extremum in x D, then gradf (x ) = 0. Proof: Reduction on 1dim. case: For i = 1, , n dene gi (h) := f (x1 , , xi + h, , xn ). If f has a local extremum in x , (x ) then all gi have a local extremum in 0. Thus we have for all i: gi (0) = 0. Since gi (0) = f xi we get f ( x ) x1 . . gradf (x ) = =0 . f (x ) xn Note: Theorem 3.22 represents a necessary condition for local extrema.

3.5 Dierential Calculus in many Variables Why is the proposition of Theorem 3.22 false if D Rn is no open set?

57

Linear algebra reminder:

Denition 3.28 Let A a symmetric n nMatrix of real numbers. A is positive (negative) denite, if all eigenvalues of A are positive (negative). A is positive (negative) semidenite, if all eigenvalues are 0 ( 0). A is indenite, if all eigenvalues are = 0 and there exist positive as well as negative eigenvalues.

Theorem 3.23 Criterium of Hurwitz Let A real valued symmetric matrix. A ist positive denite, if and only if for k = 1, , n a11 . . . ak 1 a1 k . . . akk

>0

A is negative denite if and only if -A is positive denite.

Theorem 3.24 For D Rn open and two times continuously dierentiable f : D R with gradf (x ) = 0 for x D the following holds: a) b) c) (Hessf )(x ) positive denite f has in x an isolated minimum (Hessf )(x ) negative denite f has in x an isolated maximum (Hessf )(x ) indenite f has in x no local extremum.

Note: Theorem 3.24 is void if (Hessf )(x ) is positive oder negative semidenite. Procedure for the application of theorems 3.22 and 3.23 to search local extrema of a function f : (D Rn ) R: 1. Computation of gradf 2. Computation of the zeros gradf 3. Computation of the Hessian matrix Hessf 4. Evaluation of Hessf (x ) for all zeros x of gradf . Example 3.29 Some simple functions f : R2 R: 1. f (x, y ) = x2 + y 2 + c gradf (x, y ) = 2x 2y gradf (0, 0) = 0 0 =0

58 Hessf = is positive denite on all R2 . 2. f (x, y ) = x2 y 2 + c gradf (0, 0) = 0 Hessf = isolated local maximum in 0 (paraboloid). 3. f (x, y ) = ax + by + c a, b = 0 gradf = no local extremum. 4. f (x, y ) = x2 y 2 + c gradf (x, y ) = 2x 2y a b 2 0 0 2

3 Calculus Selected Topics

f has an isolated local minimum in 0 (paraboloid).

2 0 0 2

= 0 x R2

gradf (0, 0) = 0 2 0 0 2

Hessf =

Hessf indenite f has no local extremum. 5. f (x, y ) = x2 + y 4 gradf = 2x 4y 3 gradf (0, 0) = 0 2 0 0 0

Hessf (0, 0) =

Hessf positive smidenite, but f has in 0 an isolated minimum. 6. f (x, y ) = x2 gradf = 2x 0 gradf (0, y ) = 0 2 0 0 0

Hessf (0, 0) =

Hessf positive semidenite, but f has a (non isolated) local minimum. All points on the yaxis (x = 0) are local minima. 7. f (x, y ) = x2 + y 3 gradf (x, y ) = 2x 3y 2 gradf (0, 0) = 0 2 0 0 0

Hessf (0, 0) =

Hessf positive semidenite, but f has no local extremum.

3.5 Dierential Calculus in many Variables

59

3.5.7

Extrema with Constraints

Example 3.30 Which rectangle (length x, width y ) has maximal area, given the perimeter U. Area f (x, y ) = xy . The function f (x, y ) has no local maximum on R2 !
U 2

Constraint: U = 2(x + y ) or x + y = g (x) := f (x,

substituted in f (x, y ) = xy

U U U x) = x ( x) = x x 2 2 2 2 U 2x = 0 2 x=y

g ( x) = x= y=
U 4 U 4

U g ( ) = 2 4 x = y = U/4 ist (the unique) maximum of the area for constant perimeter U ! In many cases substitution of constraints is not feasible! Wanted: Extremum of a function f (x1 , , xn ) under the p constraints h1 (x1 , , xn ) = 0 . . . hp (x1 , , xn ) = 0 Theorem 3.25 Let f : D R and h : D Rp be continuously dierentiable functions on an open set D Rn , n > p and the matrix h (x ) has rank p for all x D. If x0 D is an extremum of f under the constraint(s) h (x 0 ) = 0, there exist real numbers 1 , , p with f hk (x 0 ) + k (x 0 ) = 0 i = 1, , n xi x i k=1 and hk (x 0 ) = 0 k = 1, , p Illustration: For p = 1, i.e. only one given constraint, the theorem implies that for an extremum x0 of f under the constraint h(x 0 ) = 0 we have gradf (x 0 ) + gradh(x 0 ) = 0 gradf and gradh are parallel in the extremum x0 ! Contours of f and h for h(x ) = 0 are parallel in x0 .
p

60

3 Calculus Selected Topics The numbers 1 , , p are the Lagrange multipliers.

Note: We have to solve n + p equations with n + p unknowns. Among the solutions of this (possibly nonlinear) system the extrema have to be determined. Not all solutions need to be extrema of f under the constraint(s) h(x 0 ) = 0 (necessary but not sucient condition for extrema.) Denition 3.29 Let f, h be given as in theorem 3.25. The function L : D R
p

L(x1 , , xn ) = f (x1 , , xn ) +
k=1

k hk (x1 , , xn )

is called Lagrange function. Conclusion: The equations to be solved in theorem 3.25 can be represented as: L (x ) = 0 (i = 1, , n) xi hk (x ) = 0 (k = 1, , p) Example 3.31 Extrema of f (x, y ) = x2 + y 2 +3 under the constraint h(x, y ) = x2 + y 2 = 0
Contours of x2+y2+3 and constraint x2+y-2=0 constraint 2

1.5

y 1 0.5 0 -1.5 -1 -0.5 0 x 0.5 1 1.5

L(x, y ) = x2 + y 2 + 3 + (x2 + y 2) L (x, y ) = 2x + 2x x L (x, y ) = 2y + y gradL(x, y ) = 0 , h(x, y ) = 0

3.5 Dierential Calculus in many Variables 2x + 2x 2y + 2 x +y2 (2) in (1): 2x 4xy y 2 (3a) in (4): 2x 4x(2 x ) rst solution: x 1 = 0 2 is a maximum. 2 8 + 4 x2 = 0 4x2 = 6 x 2 ,3 = x2 =
3 2 1 2

61 = = = = = = 0 (1) 0 (2) 0 (3) 0 (4) 2 2x (3a) 0

3 2

y2,3 =

1 2

and x 3 =

1 2

3 2

are minima.

62

3 Calculus Selected Topics

Example 3.32 Extrema of the function f (x, y ) = 4x2 3xy on the disc K 0 ,1 = {(x, y )|x2 + y 2 1}.
1

0.5

-0.5

-1 -1 -0.5 0 0.5 1

Show[ContourPlot[4*x^2 - 3 *x*y, {x,-1,1}, {y,-1,1}, PlotPoints -> 60, Contours -> 20, ContourSmoothing -> True, ContourShading -> False, PlotLabel -> " "], Plot[{Sqrt[1-x^2],-Sqrt[1-x^2]}, {x,-1,1}], AspectRatio -> 1 ]

1. local extrema inside the disc D0 ,1 : gradf (x, y ) = x = 0 0 8x 3y 3x =0

is the unique zero of the gradient. Hessf = 8 3 3 0

|8| = 8 8 3 = 0 9 = 9 3 0 Hessf is neither positive nor negative denite. Eigenvalues of Hessf =: A Ax = x 8 3 3 1,2 (A )x = 0 8 3 3 =0

x = 0 det

(8 )() 9 = 0

2 8 9 = 0 = 4 16 + 9 1 = 9 2 = 1

Hessf is indenite f has no local extremum on any open set D. in particular f has on D0 ,1 no extremum!

3.5 Dierential Calculus in many Variables

63

2. Local extrema on the margin, i.e. on D0,1 : local extrema von f (x, y ) = 4x2 3xy under the constraint x2 + y 2 1 = 0: Lagrangefunction L = 4x2 3xy + (x2 + y 2 1) L = 8x 3y + 2x = (2 + 8)x 3y x L = 3x + 2y y Equations for x, y, : (1) 8x 3y + 2x (2) 3x + 2y (3) x2 + y 2 1 (1)y (2)x = (4) 8xy 3y 2 + 3x2 rst solution: (3) (3a) : =0 =0 =0 =0

y 2 = 1 x2 (3a)in(4) : 8x 1 x2 3(1 x2 ) + 3x2 = 0 Subst.: x2 = u : 8 u 1 u = 3(1 u) 3u = 3 6u 64u(1 u) 64u + 64u 36u + 36u 9 100u2 + 100u 9 9 u2 u + 100
2 2

squaring: = = = = = = = = = 9 36u + 36u2 0 0 0


9 25 =1 100 2 0 .1 0 .9 1 0.3162 10 3 10 0.9487 1 2 4 10

u1,2 =

1 2

1 4

9 100

u1 u2 x1,2 x3,4 Contours:

f (x, y ) = 4x2 3xy = c y= 4 c c + 4x2 = x 3x 3 3x 1 1 x2 3 = 10 9 3 27 3 = 10 10 10 9 3 45 +3 = 10 10 10


3 10 1 10 1 10 3 10

3 x3 = y3 = 10 f f 3 1 , 10 10 3 1 , 10 10
3 10 1 10 1 10 3 10

=4

=4

0,1 in x 1 = f (x, y ) has on K 0,1 in x 3 = f (x, y ) has on K

and in x 2 = and in x 4 =

isolated local maxima isolated local minima.

64 3.5.7.1 The Bordered Hessian

3 Calculus Selected Topics

In order to check whether a candidate point for a constrained extremum is a maximum or minimum, we need a sucient condition, similarly to the deniteness of the Hessian in the unconstrained case. Here we need the Bordered Hessian h1 h1 . . . 0 ... 0 x1 xn . . . . . . . . . . . . h h p p 0 ... 0 . . . x1 xn 2 2 Hess := h1 hp L L . . . x1 . . . x1 x1 xn x2 1 . . . . . . . . . . . . hp h1 2L 2L . . . xn xn x1 . . . xn x2
n

This matrix can be used to check on local minima and maxima by computing certain subdeterminants. Here we show this only for the two dimensional case with one constraint where the bordered Hessian has the form h h 0 x1 x2 2L 2L h Hess := x1 x x2 1 x2 1
h x2 2L x2 x1 2L x2 2

and the sucient criterion for local extrema is (in contrast to the unconstrained case!) the following simple determinant condition: Under the constraint h(x, y ) = 0 the function f has in (x, y ) a local maximum, if |Hess(x, y )| > 0 local minimum, if |Hess(x, y )| < 0.

If |Hess(x, y )| = 0, we can not decide on the properties of the stationary point (x, y ). Application to example 3.31 yields gradL(x, y ) = 2x(1 + ) 2y +

0 2x 1 Hess(x, y ) = 2x 2(1 + ) 0 . 1 0 2 Substitution of the rst solution of gradL = 0 which is x = 0, y = 2, = 4 into this matrix gives 0 0 1 |Hess(0, 2)| = 0 6 0 = 6 1 0 2 which proves that we indeed have a maximum in (0, 2).

3.6 Exercises

65

3.6

Exercises

Sequences, Series, Continuity


Exercise 3.1 Prove (e.g. with complete induction) that for p R it holds:
n

(p + k ) =
k=0

(n + 1)(2p + n) 2

Exercise 3.2 a) Calculate 1+ 1+ 1+ 1 + . . .,

i.e. the limit of the sequence (an )nN with a0 = 1 and an+1 = 1 + an . Give an exact solution as well as an approximation with a precision of 10 decimal places. b) Prove that the sequence (an )nN converges. Exercise 3.3 Calculate 1+ 1+ 1
1 1+
1 1 1+ 1+ ...

i.e. the limit of the sequence (an )nN with a0 = 1 and an+1 = 1+1/an . Give an exact solution as well as an approximation with a precision of 10 decimal places. Exercise 3.4 Calculate the number of possible draws in the German lottery, which result in having three correct numbers. In German lottery, 6 balls are drawn out of 49. The 49 balls are numbered from 1-49. A drawn ball is not put back into the pot. In each lottery ticket eld, the player chooses 6 numbers out of 49. Then, what is the probability to have three correct numbers? 1 1 1 1 1 Exercise 3.5 Investigate the sequence (an )nN with an := 1 + + + + + . . . + 2 3 4 5 n regarding convergence.

Exercise 3.6 Calculate the innite sum


n=0

1 . 2n

Exercise 3.7 Prove: A series of the partial sums is limited.

k=0

ak with k : ak > 0 converges if and only if the sequence

Exercise 3.8 Calculate an approximation (if possible) for the following series and investigate their convergence.

a)
n=0

(n + 1)2n

b)
n=0

4n (n + 1)! nn

c)
n=0

3n[4 + (1/n)]n

Exercise 3.9 Investigate the following functions f : R R regarding continuity (give an outline for each graph):
a) f (x) = 1 1 + ex b) f (x) = 0 if x = 1 else c) f (x) = x+4 if x > 0 (x + 4)2 else

1 x1

66
d) f (x) = (x 2)2 if x > 0 (x + 2)2 else e) f (x) = |x|

3 Calculus Selected Topics


f) f (x) = x x g) f (x) = x+ 1 x 2

Exercise 3.10 Show that f : R R with f ( x) = is not continuous in any point. 0 falls x rational 1 falls x irrational

TaylorSeries
Exercise 3.11 Calculate the Taylor series of sine and cosine with x0 = 0. Prove that the Taylor series of sine converges towards the sine function. Exercise 3.12 Try to expand the function f (x) = x at x0 = 0 and x0 = 1 into a Taylor series. Report about possible problems. Exercise 3.13 Let f be expandable into a Taylor series on the interval (r, r) around 0 ((r > 0). Prove: a) If f is an even function (f (x) = f (x)) for all x (r, r), then only even exponents

appear in the Taylor series of f , it has the form


k=0

a2k x2k .

b) If f is an odd function (f (x) = f (x)) for all x (r, r), then only odd exponents appear in the Taylor series of f , it has the form
k=0

a2k+1 x2k+1 .

Exercise 3.14 Calculate the Taylor series of the function f (x) = e x2 0


1

if x = 0 if x = 0

at x0 = 0 and analyse the series for convergence. Justify the result! Exercise 3.15 Calculate the Taylor series of the function arctan in x0 = 0. Use the result for the approximate calculation of . (Use for this for example tan(/4) = 1.)

Functions from Rn to Rm
Exercise 3.16 Prove that the scalar product of a vector x with itself is equal to the square of its length (norm). Exercise 3.17 a) Give a formal denition of the function f : R R+ {0} with f (x) = |x|. b) Prove that for all real numbers x, y |x + y | |x| + |y |. Exercise 3.18 a) In industrial production in the quality control, components are measured and the values x1 , . . . xn determinated. The vector d = x s indicates the deviation of the measurements to the nominal values s1 , . . . , sn . Now dene a norm on Rn such that ||d || < holds, i all deviations from the nominal value are less than a given tolerance .

3.6 Exercises b) Prove that the in a) dened norm satises all axioms of a norm.

67

Exercise 3.19 Draw the graph of the following functions f : R2 R (rst manually and then by the computer!): f1 (x, y ) = x2 + y 3 , Exercise 3.20 f : R3 R f2 (x, y ) = x2 + e(10x)
2

f3 (x, y ) = x2 + e(5(x+y)) + e(5(xy))


f , f , f x1 x2 x3

Calculate the partial derivatives

of the following functions


(x2 +x3 )

x3 2 c) f (x ) = x1 a) f (x ) = |x | b) f (x ) = xx 1 + x1 d) f (x ) = sin(x1 + x2 ) e) f (x ) = sin(x1 + a x2 )

Exercise 3.21 Build a function f : R2 R, which generates roughly the following graph:

10 7.5 5 2.5 0 -4 -2 0 2 4 -4 -2 0
-4 0

4 2
-2

-4

-2

Plot3D[f[x,y], {x,-5,5},{y,-5,5}, PlotPoints -> 30]

ContourPlot[f[x,y], {x,-5,5},{y,-5,5}, PlotPoints -> 60, ContourSmoothing -> True,ContourShading-> False]

Exercise 3.22 Calculate the derivative matrix of the function f (x1 , x2 , x3 ) = Exercise 3.23 For f (x, y ) = xy sin(ex + ey )

x1 x2 x3 sin(x1 x2 x3 ) 1 2 .

, nd the tangent plane at x0 =

Exercise 3.24 Draw the graph of the function f (x, y ) = ) for|y | > |x| y (1 + cos x y . 0 else

Show that f is continuous and partially dierentiable in R2 , but not in 0 . x2 + y 2 Exercise 3.25 Calculate the gradient of the function f (x, y ) = and draw it as 1 + x4 + y 4 an arrow at dierent places in a contour lines image of f . Exercise 3.26 The viscosity of a liquid is to be determinated with the formula K = 6vr. Measured: r = 3cm, v = 5cm/sec, K = 1000dyn. Measurement error: |r| 0.1cm, |v | 0.003cm/sec, |K | 0.1dyn. Determine the viscosity and its error .

68

3 Calculus Selected Topics

Extrema
Exercise 3.27 Examine the following function for extrema and specify whether it is a local, global, or an isolated extremum: a) f (x, y ) = x3 y 2 (1 x y ) b) g (x, y ) = xk + (x + y )2 (k = 0, 3, 4) Exercise 3.28 Given the function f : R2 R, f (x, y ) = (y x2 )(y 3x2 ). a) Calculate gradf and show: gradf (x, y ) = 0 x = y = 0. b) Show that (Hessf )(0) is semi-denite and that f has a isolated minimum on each line through 0. c) Nevertheless, f has not an local extremum at 0 (to be shown!). Exercise 3.29 Given the functions (x, y ) = y 2 x x3 , f (x, y ) = x2 + y 2 1. a) Examine for extrema. b) Sketch all contour lines h = 0 of . c) Examine for local extrema under the constraint f (x, y ) = 0. Exercise 3.30 The function f (x, y ) = sin(2x2 + 3y 2 ) x2 + y 2

has at (0,0) a discontinuity. This can be remedied easily by dening e.g. f (0, 0) := 3. a) Show that f is continuous on all R2 except at (0,0). Is it possible to dene the function at the origin so that it is continuous? b) Calculate all local extrema of the function f and draw (sketch) a contour line image (not easy). c) Determine the local extrema under the constraint (not easy): i) x = 0.1 ii) y = 0.1 iii) x2 + y 2 = 4 Exercise 3.31 Show that grad(f g ) = g gradf + f gradg .

Chapter 4 Statistics and Probability Basics


Based on samples, statistics deals with the derivation of general statements on certain features.
1

4.1

Recording Measurements in Samples

Discrete feature: nite amount of values. Continuous feature: values in an interval of real numbers. Denition 4.1 Let X be a feature (or random variable). A series of measurements x1 , . . . , xn for X is called a sample of the length n. Example 4.1 For the feature X (grades of the exam Mathematics I in WS 97/98) following sample has been recorded: 1.0 1.3 2.2 2.2 2.2 2.5 2.9 2.9 2.9 2.9 2.9 2.9 2.9 3.0 3.0 3.0 3.3 3.3 3.4 3.7 3.9 3.9 4.1 4.7 Let g (x) be the absolute frequency of the value x. Then 1 g ( x) n is called relative frequency or empirical density of X . h(x) = Grade X 1.0 1.3 2.2 2.5 2.9 3.0 3.3 3.4 3.7 3.9 4.1 4.7
1

Absolute frequency g (x) Relative frequency h(x) 1 0.042 1 0.042 3 0.13 1 0.042 7 0.29 3 0.13 2 0.083 1 0.042 1 0.042 2 0.083 1 0.042 1 0.042

The content of this chapter is strongly leaned on [?]. Therefore, [?] is the ideal book to read.

70 If x1 < x2 < . . . xn , then H ( x) =


tx

4 Statistics and Probability Basics

h(t)

is the empirical distribution function. It is apparent from the data that 8.3 % of the participating students in the exam Mathematics 1 in WS 97/98 had a grade better than 2.0. On the contrary, the following statement is an assumption: In the exam Mathematics 1, 8.3 % of the students of the HS RV-Wgt achieve a grade better than 2.0. This statemtent is a hypothesis and not provable. However, under certain conditions one can determine the probability that this statement is true. Such computations are called statistical induction.
Empirische Dichte 1 0.25 0.2 0.6 0.15 h(x) 0.1 0.05 0 1 1.5 2 2.5 3 Note X 3.5 4 4.5 H(x) 0.4 0.2 0 1 1.5 2 2.5 3 Note X 3.5 4 4.5 0.8 Empirische Verteilungsfunktion

When calculating or plotting empirical density functions, it is often advantageous to group measured values to classes. Example 4.2 Following frequency function has been determined from runtime measurements of a randomized program (automated theorem prover with randomized depth-rst search and backtracking):
4 3.5 3 2.5 Hugkeiten

ni der sequentiellen Laufzeiten ti

ni

2 1.5 1 0.5 0 0 10000 20000 30000 Laufzeit ti 40000 50000 60000

In this graphic, at any value ti {1, . . . , 60000} a frequency in the form of a histogram is shown. One can clearly see the scattering eects due to low frequencies per time value ti . In the next image, 70 values each have been summarized to a class, which results in 600 classes overall.

4.2 Statistical Parameters


16 14 12 10 Hugkeiten

71 ni der sequentiellen Laufzeiten ti

ni

8 6 4 2 0 0 10000 20000 30000 Laufzeit ti 40000 50000 60000

Summarizing 700 values each to a class one obtains 86 classes as shown in the third image. Here, the structure of the frequency distribution is not recognizable anymore. Hugkeiten ni der sequentiellen Laufzeiten ti
70 60 50 40 30 20 10 0 0 10000 20000 30000 Laufzeit ti 40000 50000 60000

ni

The amount of the classes should neither be choosen too high nor too low. In [?] a rule of thumb n is given.

4.2

Statistical Parameters

The eort to describe a sample by a single number is fulllled by following denition: Denition 4.2 For a sample x1 , x2 , . . . xn the term x = 1 n
n

xi
i=1

is called arithmetic mean and if x1 < x2 < . . . xn , then the sample median is dened as x n+1 if n odd 2 x = 1 xn + xn if n even +1 2 2 2 In the example 4.2, the arithmetic mean is marked with the symbol . It is interesting that

72

4 Statistics and Probability Basics

the arithmetic mean minimizes the sum of squares of the distances


n

(xi x)2
i=1

whereas the median minimizes the sum of the absolut values of the distances
n

|xi x|
i=1

(proof as exercise). Often, one does not only want to determine a mean value, but also a measure for the mean deviation of the arithmetic mean. Denition 4.3 The number s2 x := is called sample variance and 1 n1
n

1 n1

(xi x )2
i=1

sx := is called standard deviation

( xi x )2
i=1

4.3

Multidimensional Samples

If not only grades from Mathematics 1, but for any student also the grades of Mathematics 2 and further courses are considered, one can ask if there is a statistical relationship between the grades of dierent courses. Therefore, a simple tool, the covariance matrix is introduced. For a multidimensional variable (X1 , X2 , . . . , Xk ), a k -dimensional sample of the length n consists of a list of vectors (x11 , x21 , . . . , xk1 ), (x12 , x22 , . . . , xk2 ), . . . , (x1n , x2n , . . . , xkn ) By extension of example 4.1, we obtain an example for 2 dimensions. Example 4.3

4.3 Multidimensional Samples


Grade X 1.0 1.3 2.2 2.2 2.2 2.5 2.9 2.9 2.9 2.9 2.9 2.9 2.9 3.0 3.0 3.0 3.3 3.3 3.4 3.7 3.9 3.9 4.1 4.7

73
Grade Y 1.8 1.0 1.9 2.8 2.5 2.9 3.8 4.3 2.3 3.4 2.0 1.8 2.1 3.4 2.5 3.2 3.0 3.9 4.0 2.8 3.5 4.2 3.8 3.3

If beside the grades of Mathematics 1 (X ) the grades (Y ) of Mathematics for computer science are considered, one could determine the 2dimensional variable (X, Y ) as per margin. The question, if the variables X and Y are correlated can be answered by the covariance: n 1 (xi x )(yi y ) xy = n 1 i=1 For the grades above we determine xy = 0.47. That means that between these 2 variables a positive correlation exists, thus on average, a student being good in Mathematics 1 is also good in Mathematics for computer science. This is also visible on the left of the following two scatter plots.

Streudiagramm 1 4 0.8 3.5 3 Y 2.5 2 0.2 1.5 1 1 1.5 2 2.5 X 3 3.5 4 4.5 0 0 0.2 0.6 Y 0.4

Zufallszahlen

0.4 X

0.6

0.8

For the equally distributed random numbers in the right plot xy = 0.0025 is determined. Thus, the two variables have a very low correlation. If there are k > 2 variables, the data cannot easily be plotted graphically. But one can determine the covariances between two variables each in order to represent them in a covariance matrix : n 1 (xi x i )(xj x j ) ij = n 1 =1 If dependencies among dierent variables are to be compared, a correlation matrix can be determined: ij Kij = , si sj Here, all diagonal elements have the value 1. Example 4.4 In a medical database of 473 patients2 with a surgical removal of their appendix, 15 dierent symptoms as well as the diagnosis (appendicitis negative/positive) have been recorded.
The data was obtained from the hospital 14 Nothelfer in Weingarten with the friendly assistance of Dr. Rampf. Mr. Kuchelmeister used the data for the development of an expert system in his diploma thesis.
2

74
Alter: gender_(1=m___2=w): pain_quadrant1_(0=nein__1=ja): pain_quadrant2_(0=nein__1=ja): pain_quadrant3_(0=nein__1=ja): pain_quadrant4_(0=nein__1=ja): guarding_(0=nein__1=ja): rebound_tenderness_(0=nein__1=ja): pain_on_tapping_(0=nein__1=ja): vibration_(0=nein__1=ja): rectal_pain_(0=nein__1=ja): temp_ax: temp_re: leukocytes: diabetes_mellitus_(0=nein__1=ja): appendicitis_(0=nein__1=ja): The rst 3 data sets are as follows: 26 17 28 1 2 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 1 1 0 0 0 0 37.9 36.9 36.7

4 Statistics and Probability Basics


continuous. 1,2. 0,1. 0,1. 0,1. 0,1. 0,1. 0,1. 0,1. 0,1. 0,1. continuous. continuous. continuous. 0,1 0,1

38.8 37.4 36.9

23100 0 1 8100 0 0 9600 0 1

The correlation matrix for the data of all 473 patients is:
1. -0.009 0.14 0.037 -0.096 0.12 0.018 0.051 -0.034 -0.041 0.034 0.037 0.05 -0.037 0.37 0.012 -0.009 1. -0.0074 -0.019 -0.06 0.063 -0.17 0.0084 -0.17 -0.14 -0.13 -0.017 -0.034 -0.14 0.045 -0.2 0.14 -0.0074 1. 0.55 -0.091 0.24 0.13 0.24 0.045 0.18 0.028 0.02 0.045 0.03 0.11 0.045 0.037 -0.019 0.55 1. -0.24 0.33 0.051 0.25 0.074 0.19 0.087 0.11 0.12 0.11 0.14 -0.0091 -0.096 -0.06 -0.091 -0.24 1. 0.059 0.14 0.034 0.14 0.049 0.057 0.064 0.058 0.11 0.017 0.14 0.12 0.063 0.24 0.33 0.059 1. 0.071 0.19 0.086 0.15 0.048 0.11 0.12 0.063 0.21 0.053 0.018 -0.17 0.13 0.051 0.14 0.071 1. 0.16 0.4 0.28 0.2 0.24 0.36 0.29 -0.00013 0.33 0.051 0.0084 0.24 0.25 0.034 0.19 0.16 1. 0.17 0.23 0.24 0.19 0.24 0.27 0.083 0.084 -0.034 -0.17 0.045 0.074 0.14 0.086 0.4 0.17 1. 0.53 0.25 0.19 0.27 0.27 0.026 0.38 -0.041 -0.14 0.18 0.19 0.049 0.15 0.28 0.23 0.53 1. 0.24 0.15 0.19 0.23 0.02 0.32 0.034 -0.13 0.028 0.087 0.057 0.048 0.2 0.24 0.25 0.24 1. 0.17 0.17 0.22 0.098 0.17 0.037 -0.017 0.02 0.11 0.064 0.11 0.24 0.19 0.19 0.15 0.17 1. 0.72 0.26 0.035 0.15 0.05 -0.034 0.045 0.12 0.058 0.12 0.36 0.24 0.27 0.19 0.17 0.72 1. 0.38 0.044 0.21 -0.037 -0.14 0.03 0.11 0.11 0.063 0.29 0.27 0.27 0.23 0.22 0.26 0.38 1. 0.051 0.44 0.37 0.045 0.11 0.14 0.017 0.21 -0.00013 0.083 0.026 0.02 0.098 0.035 0.044 0.051 1. -0.0055 0.012 -0.2 0.045 -0.0091 0.14 0.053 0.33 0.084 0.38 0.32 0.17 0.15 0.21 0.44 -0.0055 1.

The matrix structure is more apparent if the numbers are illustrated as density plot3 In the left diagram, bright stands for positive and dark for negative. The right plot shows the absolute values. Here, white stands for a strong correlation between two variables and black for no correlation.
The rst to images have been rotated by 90o . Therefore, the elds in the density plot correspond to the matrix elements.
3

4.4 Probability Theory


j j 12.5 12.5 2.5 7.5 2.5 7.5

75

10

15

10

15

It is clearly apparent that most of the variable pairs have no or only a very low correlation, whereas the two temperature variables are highly correlated.

4.4

The purpose of probability theory is to determine the probability of certain possible events within an experiment. Example 4.5 When throwing a die once, the probability for the event throwing a six is 1/6, whereas the probability for the event throwing an odd number is 1/2. Denition 4.4 Let be the set of possible outcomes of an experiment. Each stands for a possible outcome of the experiment. If the wi exclude each other, but cover all possible outcomes, they are called elementary events. Example 4.6 When throwing a die once, = {1, 2, 3, 4, 5, 6}, because no two of these events can occur at the same time. Throwing an even number {2, 4, 6} is not an elementary event, as well as throwing a number lower than 5 {1, 2, 3, 4}, because {2, 4, 6}{1, 2, 3, 4} = {2, 4} = . = A = { | Denition 4.5 Let be a set of elementary events. A / A} is called the complementary event to A. A subset A of 2 is called event algebra over , if: 1. A. is also in A. 2. With A, A 3. If (An )nN is a sequence A , then n=1 An is also in A. Every event algebra contains the sure event as well as the impossible event .

0 Dichteplot d. Korrelationsmatrix 2.5 5 7.5 i 10 12.5 15

0 2.5 5 7.5 i 10 12.5 15 Betraege d. Korrelationsmatrix

Probability Theory

76

4 Statistics and Probability Basics

At coin toss, one could choose A = 2 and = {1, 2, 3, 4, 5, 6}. Thus A contains any possible event by a toss. = {1, 2, 3, 4, 5} If one is only interested in throwing a six, one would consider A = {6} and A }. only, where the algebra results in A = {, A, A, The term of the probability should give us an as far as possible objective description of our believe or conviction about the outcome of an experiment. As numeric values, all real numbers in the interval [0, 1] shall be possible, whereby 0 is the probability for the impossible event and 1 the probability for the sure event.

4.4.1

The Classical Probability Denition

Let = {1 , 2 , . . . , n } be nite. No elementary event is preferred, that means we assume a symmetry regarding the frequency of occurence of all elementary events. The probability P (A) of the event A is dened by P (A) = Amount of outcomes favourable to A |A| = || Amount of possible outcomes

It is obvious that any elementary event has the probability 1/n. The assumption of the same probability for all elementary events is called the Laplace assumption. Example 4.7 Throwing a die, the probability for an even number is P ({2, 4, 6}) = 3 1 |{2, 4, 6}| = = . |{1, 2, 3, 4, 5, 6}| 6 2

4.4.2

The Axiomatic Probability Denition

The classical denition is suitable for a nite set of elementary events only. For endless sets a more general denition is required. Denition 4.6 Let be a set and A an event algebra on . A mapping P : A [0,1] is called probability measure if: 1. P () = 1. 2. If the events An of the sequence (An )nN are pairwise inconsistent, i.e. for i, j N it holds Ai Aj = , then

P
i=1

Ai

=
i=1

P (Ai ).

For A A, P (A) is called probability of the event A. From this denition, some rules follow directly:

4.4 Probability Theory

77

Theorem 4.1 1. P () = 0, i.e. the impossible event has the probability 0. 2. For pairwise inconsistent events A and B it holds P (A B ) = P (A) + P (B ). 3. For a nite amount of pairwise inconsistent events A1 , A2 , . . . Ak it holds
k k

P
i=1

An

=
i=1

P (An ).

it holds P (A) + P (A ) = 1. 4. For two each other complentary events A and A 5. For any event A and B it holds P (A B ) = P (A) + P (B ) P (A B ). 6. For A B it holds P (A) P (B ). Proof: as exercise.

4.4.3

Conditional Probabilities

Example 4.8 In the Doggenriedstrae in Weingarten the speed of 100 vehicles is measured. At each measurement it is recorded if the driver was a student or not. The results are as follows: Event Frequency Relative frequency Vehicle observed 100 1 Driver is a student (S ) 30 0.3 Speed too high (G) 10 0.1 Driver is a student and speeding (S G) 5 0.05 We now ask the following question: Do students speed more frequently than the average person, or than non-students?4 The answer is given by the probability P (G|S ) for speeding under the condition that the driver is a student. P (G|S ) = 5 1 |Driver is a student and speeding| = = |Driver is a student| 30 6

Denition 4.7 For two events A and B , the probability for A under the condition B (conditional probability) is dened by P (A|B ) =
4

P (A B ) P (B )

The determined probabilities can only be used for further statements if the sample (100 vehicles) is representative. Otherwise, one can only make a statament about the observed 100 vehicles.

78

4 Statistics and Probability Basics

At example 4.8 one can recognize that in the case of a nite event set the conditional probability P (A|B ) can be treated as the probability of A, when regarding only the event B , i.e. as |A B | P (A|B ) = |B | Denition 4.8 If two events A and B behave as P (A|B ) = P (A), then these events are called independent. A and B are independent, if the probability of the event A is not inuenced by the event B . Theorem 4.2 From this denition, for the independent events A and B follows P (A B ) = P (A) P (B ) Beweis: Proof: P (A|B ) = P (A B ) = P (A) P (B ) P (A B ) = P (A) P (B )

Example 4.9 The probability for throwing two sixes with two dice is 1/36 if the dice are independent, because 1 1 1 = P (die 1 six) P (die 2 six) = 6 6 36 = P (die 1 six die 2 six), whereby the last equation applies only if the two dice are independent. If for example by magic power die 2 always falls like die 1, it holds 1 P (die 1 six die 2 six) = . 6

4.4.4

The Bayes Formula


P (A|B ) = P (A B ) P (B ) as well as P (B |A) = P (A B ) . P (A)

Since equation (4.7) is symmetric in A and B , one can also write

Rearranging by P (A B ) and equating results in the Bayes formula P (A|B ) = P (B |A) P (A) . P (B )

A very reliable alarm system warns at burglary with a certainty of 99%. So, can we infer from an alarm to burglary with high certainty? No, because if for example P (A|B ) = 0.99, P (A) = 0.1, P (B ) = 0.001 holds, then the Bayes formula returns: P (B |A) = P (A|B )P (B ) 0.99 0.001 = = 0.01. P (A) 0.1

4.5 Discrete Distributions

79

4.5

Discrete Distributions

Denition 4.9 A random variable, which range of values is nite or countably innite is called discrete random variable. Example 4.10 Throwing a die, the number X is a discrete random variable with the values {1, 2, 3, 4, 5, 6}, this means in the example it holds x1 = 1, . . . , x6 = 6. If the die does not prefer any number, then pi = P (X = xi ) = 1/6, this means the numbers are uniformly distributed. The probability to throw a number 5 is P (X 5) = pi = 5/6.
i:xi 5

In general, one denes Denition 4.10 The function, which assigns a probability pi to each xi of the random variable X is called the discrete density function of X .

Denition 4.11 For any real number x, a dened function x P (X x) =


i :x i x

pi

is called distribution function of X . Such as the empirical distribution function, P (X x) is a monotonically increasing step function. Analogous to the mean value and variance of samples are the following denitions. Denition 4.12 The number E (X ) =
i

xi p i

is called expected value. The variance is given by V ar(X ) := E ((X E (X ))2 ) =


i

(xi E (X ))2 pi

whereby

V ar(x) is called standard deviation.

It is easy to see that V ar(X ) := E (X 2 ) E (X )2 (exercise).

80

4 Statistics and Probability Basics

4.5.1

Binomial Distribution

Let a players scoring probability at penalty kicking be = 0.9. The probability always to score at 10 independent kicks is B10,0.9 (10) = 0.910 0.35. It is very unlikely that the player scores only once, the probability is B10,0.9 (1) = 10 0.19 0.9 = 0.000000009 We might ask the question, which amount of scores is the most frequent at 10 kicks. Denition 4.13 The distribution with the density function n Bn,p (x) = x px (1 p)nx is called binomial distribution. Thus, the binomial distribution indicates the probability that with n independent tries of a binary event of the probability p the result will be x times positive. Therefore, we obtain n B10,0.9 (k ) = k 0.1k 0.9nk The following histograms show the densities for our example for p = 0.9 as well as for p = 0.5.
0.4 0.35 0.2 0.3 0.25 0.2 0.15 0.1 0.05 0.05 0 1 2 3 4 5 6 7 8 9 10 x 0 1 2 3 4 5 6 7 8 9 10 x 0.1 0.15 B(x,10,0.9) 0.25 B(x,10,0.5)

For the binomial distribution it holds


n

E (X ) =
x=0

n x x px (1 p)nx = np

and V ar(X ) = np(1 p).

4.6 Continuous Distributions

81

4.5.2

Hypergeometric Distribution

Let N small balls be placed in a box. K of them are black and N K white. When drawing n balls, the probability to draw x black is K x N K nx . N n

HN,K,n (x) =

The left of the following graphs shows H100,30,10 (x), the right one HN,0.3N,10 (x). This corresponds to N balls in the box and 30% black balls. It is apparent, that for N = 10 the density has a sharp maximum, which becomes atter with N > 10.
H(x,N,0.3N,10) H(x,100,30,10) 0.25 0.2 0.6 0.15 0.1 0.05 1 2 3 4 5 6 7 8 9 10 11 x 0 0 2 4 x 6 8 10 10 15 20 0.4 0.2 25 N 30

As expected, the expected value of the hypergeometric distribution is E (X ) = n K . N

4.6

Continuous Distributions

Denition 4.14 A random variable X is called continuous, if its value range is a subset of the real numbers and if for the density function f and the distribution function F it holds x F (x) = P (X x) = f (t)dt.

With the requirements P () = 1 and P () = 0 (see def. 4.6) we obtain


x

lim F (x) = 0 sowie lim F (x) = 1.


x

4.6.1

Normal Distribution

The most important continuous distribution for real applications is the normal distribution with the density (x )2 1 exp . , (x) = 2 2 2

82

4 Statistics and Probability Basics

Theorem 4.3 For a normally distributed variable X with the density , it holds E (X ) = and V ar(X ) = 2 .
0.4

For = 0 and = 1 one obtains the standard normal distribution 0,1 . With = 2 one obtains the atter and broader density 0,2 .
-4 -2

0.3

0,1 (x)

0.2

0.1

0,2 (x)
2 4

Example 4.11 Let the waiting times at a trac light on a country road at lower trac be uniformly distributed. We now want to estimate the mean waiting time by measuring the waiting time T 200 times.
Haeufigkeiten der Wartezeiten (40 Klassen) 8

The empirical frequency of the waiting times is shown opposite in the image. The mean value () lies at 60.165 seconds. The frequencies and the mean value indicate a uniform distribution of times between 0 und 120 sec.

6 4 2 Wartezeit 120 t [ sec]

20

40

60

80

100

Due to the niteness of the sample, the mean value does not lie exactly at the expected value of 60 seconds. We now might ask the question, if the mean value is reliable, more precise with what probability such a measured mean diers from the expected value by a certain deviation. This will be investigated regarding the mean value from 200 times as random variable while recording a sample for the mean value. For example, we let 200 people independently measure the mean value from 200 records of the waiting time at a trac light. We obtain the following result:
0.15

The empirical density function of the distribution of the mean shows a clear maximum value t at t = 60 seconds while steeply sloping at the borders at 0 and 120 seconds. It looks like a normal distribution.

0.125 0.1 0.075 0.05 0.025 52.5 55 57.5 60 62.5 65 67.5

The kind of relation between the distribution of the mean value and the normal distribution is shown by the following theorem:

4.6 Continuous Distributions

83

Theorem 4.4 (Central Limit Theorem) If X1 , X2 , . . . , Xn are independent identically distributed random variables with (Xi ) < and Sn = X1 + . . . + Xn , then Sn tends (for n ) to a normal distribution with the expected value nE (X1 ) and the standard deviation of n . It holds
n

lim sup{|Sn (x) nE (X1 ),n(X1 ) (x)| : x R} = 0.

This theorem has some important conclusions: The sum of independent identically distributed random variables asymptotically tends to a normal distribution. The mean of the n independent measurements of a random variable is approximately normally distributed. The approximation holds better, the more measurements are made. The standard deviation of a sum X1 + . . . + Xn of identically distributed random variables is equal to n (X1 ).

Example 4.12 The following diagram shows the (exact) distribution of the mean calculated from n i.i.d. (independent identically distributed) discrete variables, each uniformly distributed: p(0) = p(1) = p(2) = p(3) = p(4) = 1/5.
Distribution of mean of uniform i.i.d. var. 0.2 0.15 p(x) 0.1 0.05 0 0 0.5 1 1.5 2 x 2.5 3 3.5 4 n=1 n=2 n=3 n=4

With the help of the central limit theorem we now want to determine the normal distribution of the mean value from example 4.11 in order to compare it with the empirical density of n after n time measurements is the mean value. The mean value t n = 1 t n Following theorem 4.4, the sum
n i=1 ti n

ti .
i=1

is normally distributed and has the density

1 (x nE (T ))2 nE (X1 ),n (x) = exp 2n 2 2 n n has the density E (T ), . The mean value t n
5

The variance 2 of the uniform distribution

This is given by the following, easy to proof property of the variance: V ar(X/n) = 1/n2 V ar(X ).

84 is still missing.

4 Statistics and Probability Basics

Denition 4.15 The density of the uniform distribution over the interval (a, b) (also called rectangular distribution) is f (x) =
1 ba

if a x b if sonst

One calculates a+b 2 a 1 V ar(X ) = E (X 2 ) E (X )2 = ba 1 E (X ) = ba x dx = Therefore, for the example one calculates (b a) 120 = = = 6 n 12n 12 200 Thus, the density of the mean value of the trac light waiting times should be approximated well by 60,6 as it can be seen in the following image.
b

(4.1)
b

x2 dx
a

a+b 2

(b a)2 12

(4.2)

0.15 0.125

Density function of the distribution of the mean value with the density of the normal distribution 60,6 .

0.1 0.075 0.05 0.025 55 60 65 70

Since we now know the density of the mean value, it is easy to specify a symmetric interval in which the mean value (after our 200 measurements) lies with a probability of 0.95. In the image above (60,6 ) we have to determine the two points u1 and u2 , which behave u2 ) = P (u1 t Because of
u2

60,6 (t) dt = 0.95


u1

60,6 (t) dt = 1

it must behave
u1 u2

60,6 (t) dt = 0.025 und


60,6 (t) dt = 0.975.

4.7 Exercises

85

Graphically, we can nd the two points u1 , u2 , searching for the x values to the level 0.025 and 0.975 in the graph of the distribution function of the normal distribution
x

60,6 (x) = P (X x) =

60,6 (t) dt

1 0.8

0.975

From the image on the opposite we read out u1 55.2, u2 64.8.

0.6 0.4 0.2 0.025 60

u1

56

58

62

64 u2

66

We now know the following: After our sample of 200 time measurements the expexted value of our waiting time t lies in the interval [55.2, 64.8] with a probability of 0.95.6 This interval is called the condence interval to the level 0.95. In general, the condence interval [u1 , u2 ] to the level 1 has the following meaning. Instead of estimating a paramater from sample measurements, we can try to determine an interval, that contains the value of with high probability. For a given number (in the example above, was 0.05) two numbers u1 and u2 are sought which behave P (u1 u2 ) = 1 . Not to be confused with the condence interval are the quantiles of a distribution. Denition 4.16 Let X be a continuous random variable and (0, 1). A value x is called -quantile, if it holds P ( X x ) = The 0.5 quantile is called median.
x

f (t) dt = .

4.7
6

Exercises

Exercise 4.1
This result is only exact under the condition that the standard deviation of the distribution of t is known. If is unknown too, the calculation is more complex.

86 1 a) Show that the arithmetic mean x = n


n n

4 Statistics and Probability Basics xi minimizes the sum of the squared distances
i=1

(xi x)2 .
i=1

b) Show that the median x n+1 x =


2

if n odd
2

1 2

xn + x n+1 2

if n even
n

minimizes the sum of the absolute values of the distances


n i=1

|xi x|. (Hint: consider by

an example how
i=1

|xi x| is going to change if x deviates from the median.)

Exercise 4.2 As thrifty, hard-working Swabians we want to try to calculate whether the German lottery is worth playing. In German lottery, 6 balls are drawn out of 49. The 49 balls are numbered from 1-49. A drawn ball is not put back into the pot. In each lottery ticket eld, the player chooses 6 numbers out of 49. a) Calculate the number of possible draws in the lottery (6 of 49 / saturday night lottery), which result in having (exactly) three correct numbers. Then, what is the probability to have three correct numbers? b) Give a formula for the probability of achieving n numbers in the lottery. c) Give a formula for the probability of achieving n numbers in the lottery with the bonus number (the bonus number is determined by an additionally drawn 7th ball). d) What is the probability that the (randomly) drawn super number (a number out of {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}) equals the last place of the serial number of the lottery ticket? e) Calculate the average lottery prize if the following sums are payed out (s.n.: super number, b.n.: bonus number):
Winning class Correct numbers Prize (6.12.1997) Prize (29.11.1997) Prize (22.11.1997) Prize (15.11.1997) Prize (8.11.1997) I 6 with s.n. 4.334.833,80 12.085.335,80 7.938.655,30 3.988.534,00 16.141.472,80 II 6 without s.n. 1.444.944,60 1.382.226,80 3.291.767,70 2.215.852,20 7.288.193,60 III 5 with b.n. 135.463,50 172.778,30 141.075,70 117.309,80 242.939,70 IV 5 10.478,20 12.905,90 11.018,40 9.537,30 14.798,30 V 4 178,20 192,30 157,50 130,70 190,10 VI 3 with b.n. 108,70 82,30 79,20 60,80 87,70 VII 3 11,00 12.10 10,10 8,70 10,90

Exercise 4.3 Show that for the variance the following rule holds V ar(X ) = E (X 2 ) E (X )2 .

Exercise 4.4 a) For pairwise inconsistent events A and B it holds P (A B ) = P (A) + P (B ). (Hint: consider, how the second part of denition 10.6 could be applied on (only) 2 events.) b) P () = 0, i.e. the impossible event has the probability 0. it holds P (A) + P (A ) = 1. c) For two complementary events A and A

4.7 Exercises d) For arbitrary events A and B it holds P (A B ) = P (A) + P (B ) P (A B ). e) For A B it holds P (A) P (B ). Exercise 4.5 Give an example for an estimator with 0 variance. Exercise 4.6 Show that for the sample variance it holds: s2 = 1 n1
n

87

(xj )2
j =1

n ( x )2 . n1

Chapter 5 Numerical Mathematics Fundamentals


5.1
5.1.1

Arithmetics on the Computer


Floating Point Numbers

The set of oating point numbers to base , with t fractional digits and exponents between m and M , can be formally dened by F (, t, m, M ) = {d : d = .d1 d2 . . . dt e } {0} Q with N 0 di 1 di : digits, d1 = 0 d1 , d2 , . . . , dt : mantissa t : mantissa length e : exponent with m e M m, M Z The oating point number .d1 d2 . . . dt e has the value d = d1 e1 + d2 e2 + + dt et Example 5.1 Let = 2 and t = 3 given, that means we consider three-digit numbers in the binary system. The number 0.101 221 has the value 0.101 221 = 1 220 + 0 219 + 1 218 = 220 + 218 . In the decimal system with = 10 we need a six-digit mantissa (t = 6), to represent this number: 220 + 218 = 1310720 = 0.131072 107 . 5.1.1.1 Distribution of F (, t, m, M ) |F (, t, m, M )| = 2 (M m + 1) ( t (t1) ) + 1 0 exponents mantissas

5.1 Arithmetics on the Computer Example 5.2 F (2, 3, 1, 2) with the upper formula we get: |F (2, 3, 1, 2)| = 2(4)(23 22 ) + 1 = 33 there are only the 0 and 32 dierent numbers between 0.100 21 , the number with smallest absolute value 0.111 22 , the number with largest absolute value The elements 0 of F(2,3,-1,2) are 5 3 7 5 7 1 5 3 7 1 5 3 7 0; , , , ; , , , ; 1, , , ; 2, , 3, 4 16 8 16 2 8 4 8 4 2 4 2 2 Distribution on the number line:

89

gap at zero 0 1/4 1/2


problems:
Exponent overow Exponent underow Round-o error

5.1.2
5.1.2.1

Round-o Errors
Round-o and Truncation Errors (absolute) F (, t, m, M )

Denition 5.1 c , r : 0. . . . M , 0. . . . M 1

with =

Round-o: x r (x) = nearest neighbor of x in F (, t, m, M ) Truncate: x c (x) = max {y F (, t, m, M )|y x} It holds: 1 absolute value Round-o Errors = |r (x) x| et 2 absolute value Truncation Error = |c (x) x| < et
2stellige Mantisse

Example 5.3 = 10 ,
10er System

t=2

, e=3
Exponent

x = 475 r (x) = 0.48 103

round-o

90 c (x) = 0.47 103 truncate

5 Numerical Mathematics Fundamentals

|r (x) x| = |480 475| = 5

1 1032 = 5 2 |c (x) x| = |470 475| = 5 < 1032 = 10

5.1.2.2

Round-o and Truncation Errors (relative) |r (x) x| 1 1t |x| 2 |c (x) x| < 1t |x|

Example 5.4 relative round-o error |480 475| 1 1 1 = 101 = |475| 95 2 20 1 1 |110 105| = < |105| 21 20 Example 5.5 t=3, = 10 110 105 = 11550 = 11600 = r (11550) Achtung: Field axioms violated! F (, t, m, M ) is not closed w.r.t. multiplication. Let {+, , , div} x, y F (, t, m, M ) : r (x y ) = x y upper bound for the smallest number! For xed number of digits, the relative error gets bigger for smaller numbers!

5.1.3

Cancellation

Example 5.6 Let = 10 and t = 8 0.1 109 0.1 101 0.1 109 0.1 101 = 1 0.1 109 0.1 109 = 0 0.1 109 0.1 109 = 0 0 + 0.1 101 = 1

a b c a+b+c r (r (a + b) + c) r (a + r (b + c)) r (r (a + c) + b)

= = = = = = =

Associative law is not valid in F (, t, m, M )

5.1 Arithmetics on the Computer

91

5.1.4

Condition Analysis

Example 5.7 Solve the linear system

x + ay = 1 ax + y = 0

x a2 x = 1 x = 1 1 a2 f ur a = 1

a = 1.002 = exact value a = 1.001 = measurement or rounding-o error relative error: solution: a a 1 = a 1002

1 249.75 0.004 1 x 499.75 0.002 x x 250 relative error = 1.001 x 249.75 x

(100% error)

See Figure 5.1.


x

a -1 1

Figure 5.1: Gain of the input error under ill-condition. Matrix A = 1 a a 1 is singular for a = 1, i.e. 1 1 1 1 =0

92

5 Numerical Mathematics Fundamentals

Denition 5.2 Let P be the problem to calculate the function f (x) with given input x. x The condition number Cp is the factor by which a relative error in the input f will be x increased, i.e. x f ( x + x) f ( x) = Cp f ( x) x (f (x + x) f (x))/f (x) f ( x) x x/x f (x) Example 5.8 Calculation of Cp Cp = x = f ( a) = Cp 1 1 a2 f (a) = 2a (1 a2 )2 It holds:

2a2 2a 2 (1 a ) a = = 501.5 (1 a2 )2 1 a2

direct calculation (see above): Cp 1002 Factor 2 due to linearization of f in a!

Denition 5.3 A problem is ill-conditioned (well-conditioned) if Cp (Cp < 1 oder Cp 1) Note: Cp depends on the input data!

5.2
see [1]

Numerics of Linear Systems of Equations

5.2.1

Solving linear equations (Gau method)

Linear System Ax = b: a11 x1 + a12 x2 + + a1n xn = b1 a21 x1 + a22 x2 + + a2n xn = b2 ................................ an1 x1 + an2 x2 + + ann xn = bn aij R n1 Questions:
Is L solvable? Is there a unique solution? How to calculate the solutions? Is there an ecient algorithm?

5.2 Numerics of Linear Systems of Equations 5.2.1.1 Gauian Elimination Method a11 x1 + a12 x2 + a22 x2 + akk xk . . . + + a1n xn = b1 + a2n xn = b2 + akn xn = bk . . . + ain xn = bi . . . + ann xn = bn

93

+ akj xj + . . . + aij xj . . . +

aik xk + . . . ank xk + The algorithm:

+ anj xj +

for k=1,...,n-1 search a_mk with |a_mk|=max{ |a_lk| : l >= k } if a_mk=0 print "singulaer"; stop swap lines m and k for i=k+1,...,n q_ik:=a_ik/a_kk for j=k+1,...,n a_ij:=a_ij - q_ik*a_kj end b_i:=b_i - q_ik*b_k end end

Theorem 5.1 Complexity: The number of operations of the Gauian elimination for large n is approximately equal to 1 n3 . 3 Proof: 1. step:

lines

columns operations operations

(n 1) (n 1 + 2)

k-ter step: (n k )(n k + 2) total:


n1 l:=nk

n1

T (n) =
k=1 n1

(n k )(n k + 2) =
l=1

(l(l + 2))

=
l=1

(l2 + 2l) = =

n3 n2 n + + n(n 1) 3 2 6

n3 n2 5 + n 3 2 6 n3 for large n: 3

94 Example 5.9 Computer with 1 GFLOPS n 10 1/3 103 109 100 1/3 1003 109 1000 1/3 10003 109 10000 1/3 100003 109 Problems/Improvements: 1. long computing times for large n
better algorithms

5 Numerical Mathematics Fundamentals

T (n) sec sec sec sec

0.3 0.3 0.3 300

sec msec sec sec = 5 min

T (n) = C n2.38
Iterative method (Gau-Seidel)

instead of

1 3 n 3

2. Round-o error
complete pivoting Gau-Seidel

Applications:
Construction of curves through given points Estimation of parameters (least squares) Linear Programming Computer graphics, image processing (e.g. computer tomography) Numerical solving of dierential equations

5.2.1.2

Backward Substitution

After n-1 elimination steps: a11 a12 a1n 0 a a2n 22 A = . . .. . 0 0 . 0 0 0 ann

Ax=b

with

Calculation of x1 , . . . , xn : xn = xn1 bn ann bn1 an1,n xn = an1,n1

General:

5.2 Numerics of Linear Systems of Equations bi


n k=i+1

95

xi =

aik xk

aii

i = n, n 1, . . . , 1 Runtime:
Divisions: n Number of additions and multiplications:
n n1

(i 1) =
i=1 i=1

1 1 i = n(n 1) n2 2 2

Substitution is much faster than elimination! 5.2.1.3 Backward Elimination

A slight variant of the backward substitution is the backward elimination, where the upper right triangle of the matrix is being substituted similarly to the Gau elimination. This variant is called Gau-Jordan method. One application of this method is the computation of inverse matrices. Theorem 5.2 Correctness: The Gauian Method results in a unique solution (x1 , . . . , xn ) if and only if the linerar system L has a unique solution (x1 , . . . , xn ). Proof: as exercise

5.2.2

Iterative improvement of the solution

Let x the calculated solution of Ax = b with the Gau method. In general Ax = b r with r = 0 (r: residual vector) because of x = x + x. Ax = A(x x) = b r A x = r With this equation the correction x can be calculated. better approximation for x: x(2) = x + x Iterative Method: x(1) := x for n = 1, 2, 3, . . .: r(n) = b Ax(n) calculate x(n) nach Ax(n) = r(n) x(n+1) = x(n) + x(n)

96 Note:

5 Numerical Mathematics Fundamentals

1. usually (A not very ill-conditionated) very few iterations ( 3) necessary. n3 ). With LU decomposition (see 5.2.3) 2. Solving Ax(n) = r(n) is time-consuming: O( 1 3 (n) (n) 2 of A, Ax = r can be solved in O(n ) steps. 3. Must a system of equations be solved for more than one right hand side, all solutions will be calculated simultaneously (elimination necessary only once!)

5.2.3

LU-Decomposition

The Gauian elimination (see algorithm) multiplies row i with the factor qik := aik /akk for the elimination of each element aik in the k -th column below the diagonal. If we write all calculated qik in a lower triangular matrix, in which we add ones in the diagonal, we get 1 0 ... ... 0 . . q21 1 0 . . ... . L := . q31 q32 1 . . . ... ... . . . . 0 qn1 qn2 . . . qnn1 1 Furthermore, let U := A = a11 a12 a1n 0 a22 a2n . . .. .. . . . . . . 0 . . . 0 ann

the upper triangular matrix after the elimination. Theorem 5.3 Then L U = A holds and the solution x of the system Ax = b for any right hand side b can be calculated by solving the equation L c = b for c and solving U x = c for x. The system L c = b is solved by forward substitution and U x = c by backward substitution. Proof: We will show that L U = A. Then obviously it holds A x = L U x = b. Now we write L U = A in 1 q21 LU = q31 . . . qn1 detail: 0 1 ... 0 .. ... 0 1 0 . . . . . . a11 a12 a1n 0 a22 a2n . . .. .. . . . . . . 0 . . . 0 ann =A

. q32 1 . . . . .. .. . qn2 . . . qnn1

5.2 Numerics of Linear Systems of Equations We now apply the Gauian elimination on both sides and get 1 0 ... U =U 0 1

97

Thus LU = A. Because of the associativity of matrix multiplication only L has to be eliminated on the left side. Exercise 5.1 How could you factor A into a product U L, upper triangular times lower triangular? Would they be the same factors as in A = LU ?

5.2.4

Condition Analysis for Matrices


Ax = b with A : Matrix (n n) and x, b Rn

What is the Norm of a matrix? Vector Norm: Denition 5.4 (p-Norm) x Rn : x


p

= (|x1 |p + |x2 |p + + |xn |p ) p 1p<

Theorem 5.4 x
x = 0 : x
p

is a norm, i.e. it has the properties: x


p p p

>0 ;
p

=0x=0

R : x

= || x
p

x, y Rn : x + y

+ y

Lemma 5.1 (H older inequality) For real numbers p, q > 1 with n x , y R we have x y 1 x p y q.
n n

1 p

1 q

= 1 and vectors

Proof: Since x y

=
i=1 n

xi y i
i=1

|xi yi | it remains to prove


n
1 p

1 q

|xi yi |
i=1 i=1

|xi |p
i=1

|yi |q

For real numbers a, b > 0 we have (proof as exercise) ab ap b q + , p q

98 which we apply now to get


n

5 Numerical Mathematics Fundamentals

i=1

|xi yi | x p y
n

=
q p i=1 n

|xi ||yi | x p y
q

q i=1

1 |xi |p 1 |yi |q + p x p q y q p q
n p p

=
i=1

1 |xi | + p x p p

i=1

1 1 |yi | q = q y q p x

i=1

1 |xi | + q y
p

n q q

|yi |q =
i=1

1 1 + =1 p q

Proof of proposition 3 in Theorem 5.4: For the cases p = 1 and p = see exercises. For 1 < p < : |xi + yi |p = |xi + yi ||xi + yi |p1 (|xi | + |yi |)|xi + yi |p1 = |xi ||xi + yi |p1 + |yi ||xi + yi |p1 Summation yields
n n n

|xi + yi |
p i=1 i=1

|xi ||xi + yi |

p1

+
i=1

|yi ||xi + yi |p1 .

(5.1)

Application of the H older inequality to both terms on the right hand sides gives
n n
1 p

1 q

|xi ||xi + yi |p1


i=1 i=1

|xi |p
i=1
1 p

(|xi + yi |p1 )q
1 q

and
n n

|yi ||xi + yi |p1


i=1 i=1

|yi |p
i=1

(|xi + yi |p1 )q

what we substitute in Equation 5.1 to obtain 1


n n
p

1 p

1 q

|xi + yi |p
i=1 i=1

|xi |p

+
i=1

|yi |p

|xi + yi |p
i=1

In the rightmost factor we used (p 1)q = p. Now we divide by the rightmost factor, using 1 =1 1 and get the assertion p q
n
1 p

1 p

1 p

|xi + yi |p
i=1

i=1

|xi |p

+
i=1

|yi |p

Lemma 5.2 x

:= max |xi | = lim x


1in p

is

called maximum norm

In the following let x = x maximum norm:

5.2 Numerics of Linear Systems of Equations

99

Denition 5.5 For any vector norm

the canonical matrix norm is dened as follows: Ax x

A = max
x=0

Lemma 5.3 The matrix norm is a norm and for a n m matrix A it holds
m

= max

1in

|aij |
j =1

Ax A x AB A B Condition of a matrix: Consequence of errors in the matrix elements of A or the right hand side b on errors in the solution x. 1. Error in b: b x A(x + x) x x = = = = = b + b x + x b + b A1 b A1 b A1 b b b A 1 x b

b = Ax

b A x x A A1 x
x x b b

CA

with CA = A A1 CA : condition number of A 2. Error in A: (A + A)(x + x) = b x + x = (A + A)1 b = (A + A)1 Ax x = (A + A)1 A I x = (A + A)1 (A (A + A)) x = (A + A)1 Ax

100

5 Numerical Mathematics Fundamentals x (A + A)1 A x

A A x (A + A)1 A CA A1 A x A A CA analogous to Cp : Cp = f (x) x f ( x)

Example 5.10 1 a a 1 A 1 = 1 a2 x= 1 0 b 1 a a 1 for a > 0

A1 A = 1 + a,

A1 =

1+a 1 = 1 a2 1a

CA = A A1 = a=1.002: A= 1 1.002 1.002 1 A = 0.001,

(1 + a)2 1+a = 1 a2 1a 0 0.001 0.001 0

A =

CA = 1001 A = 2.002

0.001 x 1001 = 0.5 x 2.002

5.3

Roots of Nonlinear Equations

given: nonlinear equation f (x) = 0 sought: solution(s) (root(s))

5.3.1

Approximate Values, Starting Methods

Draw the graph of f (x), value table Example 5.11 f (x) = Table: x 2
2

sin x

5.3 Roots of Nonlinear Equations


1,9

101
x

1 sin(x) 2 (x / 2) x 1 Pi/2 2 Pi -0.6

f(x)

Figure 5.2: Graph to nd the start value. x (x/2)2 1,6 0,64 1.8 0.81 2.0 1.00 sin x f (x) 0.9996 < 0 0.974 < 0 0.909 > 0

Root in [1.8; 2.0] in general: if f continuous and f (a) f (b) < 0 f has a root in [a, b]. Interval bisection method Requirements : f : [a, b] R continuous and f (a) f (b) < 0. Without loss of generality f (a) < 0, f (b) > 0 (otherwise take f (x))
y

a=a0

m(k+2) mk m(k+1) b=b0 x

Figure 5.3: Root in the interval [a,b] can be determinated quickly by using the interval bisection
method..

Algorithm: mk = 1 (a + bk 1 ) 2 k1 (mk , bk1 ) if f (mk ) < 0 (ak , bk ) = (ak1 , mk ) if f (mk ) > 0 (Root found exactly if f (mk ) = 0)! Theorem 5.5 Let f : [a, b] R continuous with f (a) f (b) < 0. Then the interval bisection a method converges to a root x of f . After n steps x is determinated with a precision of b2 n .

102

5 Numerical Mathematics Fundamentals

For the proof of theorem 5.5 the following denition and theorem are required: Denition 5.6 A sequence (an ) is a Cauchy sequence, if: > 0 : N N : n, m N : |am an | <

Theorem 5.6 In R every Cauchy sequence converges. Proof of theorem 5.5: 1. Speed of Convergence: n-th step: 1 1 1 (bn an ) = (bn1 an1 ) = . . . = n (b0 a0 ) = n (b a). 2 2 2 2. Convergence: 1 1 x = mn+1 (bn an ) = mn+1 n+1 (b a) 2 2 For m n + 1 it holds |am an | bn an = 1 (b a) < for large enough n. 2n

(an ), (bn ) are Cauchy sequences (an ), (bn ) converges with


n

lim an = lim bn = x
n

because of f (an ) < 0 < f (bn ) and continuity of f . Note: 1. for each step, the precision is doubled, respectively the distance to the solution halved. Thus for each step, the precision is improved by a binary digit. because of 101 23.3 about 3.3 steps are necessary to improve the precision by a decimal digit. slow convergence! (Example: for 12-digits precision, about 40 steps required) 2. slow convergence, because only the sign of f is used, f (an ), f (bn ) is never used! better methods use f (x), f (x), f (x), . . . 3. interval bisection methods also applicable on discontinuous functions Exercise 4. discrete variants of interval bisection: Bisection Search (=ecient search method in ordered les) T (n) log2 (n) instead of T (n) n with n=number of entries in the le. limn f (an ) = f ( x) 0 limn f (bn ) = f ( x) 0 f ( x) = 0

5.3 Roots of Nonlinear Equations 5. Why log2 (n) steps? Let n = b a the number of entries in the le. bk ak Number of steps to bk ak 1 n 1 2k n k log2 n 2k 1 n ( b a) = k k 2 2

103

6. interval bisection methods globally convergent!

5.3.2

Fixed Point Iteration

Goal: Solution of equations of the form x = f ( x) Iterative Solution: x0 = a xn+1 = f (xn ) (Fixed Point Equation)

(n = 0, 1, 2, . . .)

Example 5.12 In Figure 5.4 the solution of the xed point equation x = f (x) for various functions f is shown graphically. Denition 5.7 A function f : [a, b] [a, b] R is called a contraction on [a,b], if a (Lipschitz) constant L with 0 < L < 1 exists with |f (x) f (y )| L|x y | x, y [a, b] Lemma 5.4 If f : [a, b] [a, b] is dierentiable, then f is a contraction on [a, b] with Lipschitz constant L if and only if holds: x [a, b] : |f (x)| L < 1

Proof: : let |f (x) f (y )| L|x y | x, y [a, b] x, y : lim : (more dicult omitted) |f (x) f (y )| L |x y |

xy

|f (x) f (y )| = |f (y )| L |x y |

104
y y=f(x) y=x

5 Numerical Mathematics Fundamentals


y y=f(x) y=x

f(x0)

f(x0)

x0

x0

y y=x

y y=x

y=f(x) f(x3) f(x2) f(x1) f(x0) y=f(x) f(x0)

x x0 x1 x2 x3 x0

Figure 5.4: two examples of divergent and convergent iterations.


f(x) y=f(x) y=x

sqrt(2)

Figure 5.5: . Example 5.13 f ( x) = 1 a x+ 2 x

f ( x) =

1 a 2 2 2x f (x) > 1

5.3 Roots of Nonlinear Equations a 1 2 > 1 2 2x 3 a > 2 2 2x a x> 3 a=2: x> f is a contraction on [
a 3

105

2 0.817 3

+ , ] for > 0.

Theorem 5.7 Banach Fixed Point Theorem: Let f : [a, b] [a, b] R be a contraction. Then the following holds 1. f has exactly one xed point s [a, b]. 2. For any initial value x0 [a, b] xed point iteration converges to s. 3. The cuto error can be estimated by: |s xk | For l = 0 we get |s xk | and for l = k 1 : |s xk | L |xk xk1 | 1L (a posteriori estimation). Lk |x1 x0 | (a priori estimation) 1L Lkl |xl+1 xl | for 0 l < k 1L

Proof: |xk+1 xk | = = = = for l = 0: |xk+1 xk | Lk |x1 x0 | |f (xk ) f (xk1 )| L|xk xk1 | L|f (xk1 ) f (xk2 )| L2 |xk1 xk2 | ... Lkl |xl+1 xl | for 0lk

106

5 Numerical Mathematics Fundamentals


k+m1

|xk+m xk | = |xk+m xk+m1 + xk+m1 . . . + . . . xk | =


=0 k+m1 =0 i= k

xi+1 xi

i= k

|xi+1 xi | Lk (Lm1 + Lm2 + . . . + L + 1)|x1 x0 | 1 Lm |x1 x0 | 0 1L f ur k

= Lk

(xk ) Cauchy Sequence (xk ) converges for s = limn xn we have f (s) = f (limn xn ) = limn f (xn ) = limn xn+1 = s. Thus s is xed point of f and s is unique, since for s1 , s2 with s1 = f (s1 ), s2 = f (s2 ) it holds: |s1 s2 | = |f (s1 ) f (s2 )| L|s1 s2 | Error estimation see [12] p. 188 Example 5.14 f (x) = f contract on [2, ] with L = 0.5. Theorem 5.7 (3) with l = k 1 : |s xk | L |xk xk1 | (a posteriori estimation) 1L 0.5 | 5 xk | |xk xk1 | = |xk xk1 | 1 0.5 (xn xn1 ) ( 5 xn ) 0.25 0.0139 0.000043 0.00000000042 1 a x+ 2 x a = 5, x0 = 2 because of L < 1 s1 = s2

n 0 1 2 3 4

xn 2 2.25 2.2361111 2.2360679779 2.2360679775

0.00000000042 (a posteriori) 0.031 (a priori) Note: Theorem 5.7 (3) gives estimation of the error without knowing the limit! Example 5.15 f (x) = exp (x) = x f : A A, A = [0.5, 0.69]

L = max |f (x)| = max | ex | = e


xA 0.5 xA

0.606531 < 1

5.3 Roots of Nonlinear Equations


f(x) y=x

107

1 y=exp(-x) x

Figure 5.6: . k 0 1 2 3 4 . . . 12 . . . xk 0.55 0.577 0.562 0.570 0.565 . . . 0.56712420 . . . Theorem 5.7 (3) with l = 0: |sxk | Lk |x1 x0 | (a priori estimation) 1L

Calculation of k , if |s xk | = 106 k log


(1L) |x 1 x 0 |

log L

22.3

20 0.56714309 21 0.56714340 22 0.56714323 Result:

Error after 12 steps: a priori: |s x12 | 1.70 104 a posteriori: |s x12 | 8.13 105 (better!)

The iteration in the rst example converges much faster than in the second example.

5.3.3

Convergence Speed and Convergence Rate

Denition 5.8 k := xk s is called cuto error Fixed Point Theorem (f contract): |k+1 | = |xk+1 s| = |f (xk ) f (s)| L|xk s| = L|k | Error decreases in each step by factor L! Theorem 5.8 If f : [a, b] [a, b] satises the conditions of Theorem 5.7 and is continuously dierentiable with f (x) = 0 x [a, b], then it holds:
k

lim

k+1 = f (s) k

108

5 Numerical Mathematics Fundamentals

Proof: as exercise Conclusions: k+1 qk with q := f (s) (convergence rate) (xk ) is called linear convergent with convergence rate |q |. 1 k after m steps error k+m 10 m =? k+m q m k = 101 k m log10 |q | 1 m

1 log10 |q |

|q | = |f (s)| 0.316 0.562 0.75 0.891 0.944 0.972 m 2 4 8 20 40 80 Theorem 5.9 Let f be contracting with f (s) = 0, x [a, b]f (x) = 0 and f continuous on [a, b]. Then it holds: k+1 1 lim 2 = f (s) k 2 k Conclusion: 1 with p := f (s) 2 quadratic convergence (convergence with order=2) Correct number of digits is doubled in each step (if p 1), because for k : k+1 p2 k k+1 = p2 k log k+1 = log p + 2 log k log p log k+1 = +2 log k log k
0

Example 5.16 k+1 = 108 , k = 104 Proof of Theorem 5.9: k+1 = xk+1 s = f (xk ) f (s) = f (s + k ) f (s) 1 f (s + k k ) f (s) = f (s) + k f (s) + 2 2 k
=0

1 2 = f (s + k k ) with 0 < k < 1 2 k because of f (x) = 0 x [a, b] and x0 = s it holds: k > 0 : xk s = k = 0 k+1 1 2 = f (s + k k ) k = 0, 1, 2, . . . k 2 1 1 1 = lim f (s + k k ) = f (s + lim (k k )) = f (s) k 2 k 2 2
=0

k+1 k 2 k lim

5.3 Roots of Nonlinear Equations

109

5.3.4

Newtons method

sought: Solutions of f (x) = 0


f(x)

x(k+1) x(k)

Figure 5.7: . The Tangent: T (x) = f (xk ) + (x xk )f (xk ) T (xk+1 ) = 0 f (xk ) + (xk+1 xk )f (xk ) = 0 (xk+1 xk )f (xk ) = f (xk )

xk+1 = xk

f (xk ) f ( xk )

(5.2)

k = 0, 1, 2, . . . with F (x) := x f (x) f ( x) is (5.2) for the xed point iteration

xk+1 = F (xk ) with F (s) = s (xed point) Theorem 5.10 Let f : [a, b] R three times continuously dierentiable and s [a, b] : f (s) = 0, as well x [a, b] : f (x) = 0 and f (s) = 0. Then there exists an interval I = [s , s + ] with > 0 on which F : I I is a contraction. For each x0 ,(xk ) is (according to 5.2) quadratically convergent. Proof: 1. F is a contraction in the area of s, i.e. |F (x)| < 1 fors x s + f (x)2 f (x)f (x) f (x)f (x) = f (x)2 f (x)2 0f (s) F (s) = = 0. f (s)2 F (x) = 1 (5.3)

110

5 Numerical Mathematics Fundamentals Because of the continuity of F , > 0 exists with F (x) L < 1 F is a contraction in I x [s , s + ] =: I

lim xk = s
k

2. Order of Convergence: Application of Theorem 5.9 on F : F (s) = 0 from (5.3) we get: F (x) = f (x)2 f (x) + f (x)f (x)f (x) 2f (x)f (x)2 f (x)3 F (s) = f (s)2 f (s) f ( s) = f (s)3 f (s)

According to Theorem 5.9 ,(xk ) is quadratically convergent on I if and only if f (s) = 0. (otherwise even higher order of convergence)

5.4

Exercises

Exercise 5.2 Prove the triangular inequality for real numbers, i.e. that for any two real numbers x and y we have |x + y | |x| + |y |. Exercise 5.3 a) Calculate the p-norm x
p

of the vector x = (1, 2, 3, 4, 5) for the values of p = 1, 2, . . . , 50.

b) Draw the unit circles of various p-norms in R2 and compare them. c) Prove that the p-norm is a norm for p = 1, . d) Show that for x 0 and 0 < p < 1 the inequality xp px 1 p holds (hint: curve sketching of xp px). e) Show by setting x = a/b and q = 1 p in the above inequality, that for a, b > 0 the inequality ap bq pa + qb holds. f ) Show using the above result that for a, b > 0, p, q > 1 and q p ab ap + bq holds. Exercise 5.4 Prove Lemma 5.2, i.e. that x
1 p

1 q

= 1 the inequality

= limp x

Exercise 5.5 a) Write a Mathematica program using LinearSolve, which solves a linear system symbolically and apply it to a linear system with up to seven equations. b) Show empirically that the length of the solution formula grows approximately exponentially with the number of equations. Exercise 5.6 Show that the addition of the k -fold of row i of a square matrix A to another row j can be expressed as the product G A with a square matrix G. Determine the matrix

5.4 Exercises G.

111

Exercise 5.7 Prove theorem 5.2, i.e. that the Gaussian method for solving linear systems is correct. Exercise 5.8 Apply elimination to produce 3 2 1 A= , A= 1 8 7 1 Exercise 5.9 Calculate for the matrix 1 2 3 A= 1 0 1 2 1 1 the matrices L and U of the LU decomposition. Then determine the solutions of Ax = b for the right sides (1, 1, 1)T and (3, 1, 0)T . Exercise 5.10 If A = L1 D1 U1 and A = L2 D2 U2 , prove that L1 = L2 , D1 = D2 and U1 = U2 . If A is invertible, the factorization is unique. 1 1 a) Derive the equation L and explain why one side is lower triangular 1 L2 D2 = D1 U1 U2 and the other side is upper triangular. b) Compare the main diagonals in that equation, and then compare the o-diagonals. Exercise 5.11 For the calculation of a, the iteration of xn+1 = a/xn with a > 0, x0 > 0 can be tried. a) Visualize the iteration sequence. b) Explain on the basis of drawing why the sequence does not converge. c) Prove that this sequence does not converge. d) How to change the iteration formula xn+1 = a/xn , so that the sequence converges? Exercise 5.12 a) What means convergence of a sequence (xn )nN ? (Denition!) b) Give a convergent, divergent, alternating convergent and alternating divergent sequence. c) Give at least one simple convergence criterion for sequences. Exercise 5.13 Apply the interval bisection method to the function f (x) = x(1 x) 1 x2 the factors L and 1 1 1 3 1 , A= 1 1 3 1 U for 1 1 4 4 4 8

with the initial interval [4, 1/2]. Calculate the limit of the sequence with at least 4 digits. Give reasons for the surprising result. Exercise 5.14 Sought are the solutions of the equation tan x = cos x in the interval [0, /2]. a) Show that the equation (5.4) in [0, /2] has exactly one solution. (5.4)

112

5 Numerical Mathematics Fundamentals

b) In the following, the equation (5.4) is to be solved by xed pointiteration. Therefore use the form: x = f (x) := arctan(cos x) (5.5) Give the smallest possible Lipschitz bound for f and a corresponding sub-interval of [0, /2]. c) Determine an a priori estimation for the number of iterations for a precision of at least 103 . d) Calculate the iteration sequence (xn ) of the xed-point iteration with the initial v alue x0 = /4 to n = 10. e) Determine an interval in which the root is for sure using the a posteriori estimation after 8 steps. f ) Why is the transformation of the equation (5.4) to x = arccos(tan x) less favorable than those used above? g) Write a simple as possible Mathematica program (3-4 commands!), which calculates the iteration sequence and stores it in a table. Exercise 5.15 Prove theorem 5.8, i.e. if f : [a, b] [a, b] is a contraction and is continuously dierentiable with f (x) = 0 x [a, b], then it holds:
k

lim

k+1 = f (s) k

Exercise 5.16 a) Prove that any contracting function f : [a, b] [a, b] R is continuous. b) Prove that not all contracting functions f : [a, b] [a, b] R are dierentiable. c) Prove that any dierentiable function f : D R, (D R open) is continuous.

Chapter 6 Function Approximation


6.1 Polynomial Interpolation

Example 6.1 Linear interpolation (see gure Figure 6.1) When there were no calculators, using logarithms for practical purposes was done with tables of logarithms. Only integers were mapped, intermediate values were determined by linear interpolation.
y 3.0903

3.0899

1230

1230.3 1231

Figure 6.1: Determination of lg (1230.3) using linear interpolation.

lg(1230) lg(1231) lg(1230.3) lg(1230.3)

= = =

3.0899 3.0903 ? 3.0899 + 4 0.0001 0.3 = 3.09002

6.1.1

Motivation

Higher order interpolation (quadratic,...) Tools for numerical methods (functional approximation, numerical dierentiation, integration ,...)

114

6 Function Approximation

6.1.2

The Power Series Approach

Given: Table (xk , yk ) for (k = 1, . . . , n) Sought: Polynomial p with p(xi ) = yi for (i = 1, . . . , n) Ansatz: p(x) = a1 + a2 x + + an xn1
n1 = yi a1 + a2 xi + a3 x2 i + + an x i 1 a1 y1 1 . . A . . . = . . with A = . . an yn 1

for (i = 1, . . . , n) x1 x2 1 x2 x2 2 xn x2 n
n1 x1 n1 x2 . . . n1 xn

Vandermonde matrix Theorem 6.1 If x1 , . . . , xn are distinct, then for any y1 , . . . , yn there is a unique polynomial p of degree n 1 with p(xi ) = yi for (i = 1, . . . , n). Proof: To show that equation Aa = y is uniquely solvable, we show that the nullspace of A is 0 , i.e. Aa = 0 a = 0 : Aa = 0 i = 1, . . . , n : p(xi ) = 0 p(x ) 0 (zero polynomial) a =0

Example 6.2 Interpolation of sin(x) Table of values in {m, m + 1, . . . , 0, 1, 2, . . . , m} sin(0.5) = 0.479426 p(0.5) = 0.479422 (m=3, i.e. n=7 points) p(0.5) = 0.469088 (m=2, i.e. n=5 points) sin(x) is well approximated by the interpolating polynomial, even at relatively small number of given points (n=5,7), as can be seen in Figure 6.2, Figure 6.3 and Figure 6.4. Example 6.3 Interpolation of f (x) in the interval [-1,1]: f (x) = 1 1 + 25x2

Figure 6.5 clearly shows the poor approximation particulary in the margin areas. Idea: more given points in the margin areas Improvement: Chebyshev interpolation

6.1 Polynomial Interpolation


2

115

-3

-2

-1

-1

-2

Figure 6.2: Interpolation of sin(x) with n = 5 given points.


1

0.5

-4

-2

-0.5

-1

Figure 6.3: Interpolation of sin(x) with n = 7 given points.

Denition 6.1 For any f : [a, b] R we dene f

:= maxx[a,b] |f (x)|

Theorem 6.2 Let f : [a, b] R be n-times continuously dierentiable. Let a = x1 < x2 < . . . < xn1 < xn+1 = b and p the interpolating polynomial of degree n with p(xi ) = f (xi ) for (i = 1, . . . , n). Then f (x) p(x) = for a point z [a, b]. Note:
remainder term is the same as in Taylors theorem for x1 = x2 = = xn+1

f (n+1) (z ) (x x1 )(x x2 ) (x xn+1 ) (n + 1)!

116

6 Function Approximation

2 1

-7.5

-5

-2.5 -1 -2

2.5

7.5

Figure 6.4: Interpolation of sin(x) with n = 15 given points.


2

1.5

0.5

-1

-0.5

0.5

Figure 6.5: Interpolation with 11 given points.


right hand side equals zero for x = xi (i.e. in all given points)

Question: How should the given points x1 , . . . , xn+1 be distributed, to minimize (for constant n) the maximum error? Answer: Chebyshev interpolation Theorem 6.3 Let f : [1, 1] R and p the interpolating polynomial at the given points 1 x1 < < xn 1. The approximation error f p = maxx[1,1] |f (x) p(x)| is minimal for xk = cos 2k 1 n 2 (k = 1, . . . , n)

The values xk are called Chebyshev abscissas.

6.1 Polynomial Interpolation Example 6.4 Let n=6. the Chebyshev abscissas are (see also Figure 6.6). k 2k-1 (2k 1) -cos 12 1 2 3 4 5 6 1 3 5 7 9 11 -0.966 -0.707 -0.259 0.259 0.707 0.966
0.52 0.26

117

1 equidistant grid

0 Chebyshev abscissae

Figure 6.6: Distribution of the given points. Example 6.5 Figure 6.7 shows a signicant reduction in the maximum norm of the error when Chebyshev interpolation is applied.
1

0.8

0.6

0.4

0.2

-1

-0.5

0.5

Figure 6.7: Chebyshev interpolation with 11 given points. Corollar 6.1.1 Theorem 6.3 can be applied easily to functions f : [a, b] R, by calculating the given points tk for k = 1, . . . , n out of the Chebyshev abscissas xk by 1 1 tk = (a + b) + (b a)xk 2 2 . Additional notes: 1. Are polynomials suitable for approximating a given function f ? Polynomials are not suitable for functions alternating between strong and weak curvature or poles. Possibly: piecewise approximation by polynomials ( spline approximation) or approximation by rational functions.

118

6 Function Approximation

2. Is a polynomial well dened by the value tables data? equidistant given points Chebyshev abscissas or choose smaller degree of the poly nomial overdetermined system of linear equations (degree(p) 2 n in which n=Number of given points).

6.1.3

The Horner scheme


n

By using the following scheme, computing time will be saved in the evaluation of polynomials: p ( x) =
k=1

ak xk1 = a1 + a2 x + . . . + an xn1

= a1 + x(a2 + x(a3 + x(. . . + x(an1 + xan ) . . .)))

Iteration:

y0 := an yk := yk1 x + ank

k = 1, . . . , n 1

p(x) = yn1 Computing time: (n-1) Additions + Multiplications naive evaluation: (xk = x x . . . x x) k-times (n-1) additions, (n-2)-times potentiate, (n-1) multiplications
n1

k=
k=0

1 n(n 1) = (n2 n) multiplications 2 2

6.1.4

Function Approximation vs. Interpolation

In interpolation n points (xk , yk ) with (k = 1, . . . , n) are given and a function p (e.g., a polynomial of degree n-1) is sought with p(xk ) = yk for (k = 1, . . . , n). In the approximation of functions, a function f : [a, b] R is given (symbolically by a formula or a value table with possibly noisy values) and the task is to nd the simplest possible function p, which approximates f as good as possible with respect to a norm (e.g. maximum norm). The function p can be a polynomial but also a linear combination of basis functions such as p(x) = a1 sin x + a2 sin 2x + a3 sin 3x + + an sin nx where a1 , . . . , an are to be determinated). Interpolation can be used as a tool for function approximation.

6.2
6.2.1

Spline interpolation
Interpolation of Functions

Given: Value table (xk , yk ) with k = 0, 1, . . . , n

6.2 Spline interpolation

119

Sought: Interpolating (function) s(x) with s(xk ) = yk , and s(x) must be two times continuously dierentiable. Ansatz: piecewise cubic polynomials

s(x) y2 y1 y0 x0 s0(x) x x1 x2 x3
Figure 6.8: natural cubic spline through 4 points. The property of s(x) to be two times continuously dierentiable implies: s (x) continuous , s (x) continuous at all inner interval limits. 2 additional conditions for each cubic polynomial the n subpolynomials uniquely determined by 2 points + 2 derivation conditions.

s1(x)

s2(x)

ansatz: s ( x) requirements: si (xi ) sn1 (xn ) si (xi+1 ) si (xi+1 ) si (xi+1 )

for (i=0,. . . ,n-1) let = si (x) = ai (x xi )3 + bi (x xi )2 + ci (x xi ) + di = = = = = yi i=0,. . . ,n-1 yn si+1 (xi+1 ) i=0,. . . ,n-2 si+1 (xi+1 ) i=0,. . . ,n-2 si+1 (xi+1 ) i=0,. . . ,n-2

(6.1) (6.2) (6.3) (6.4) (6.5) (6.6)

n + 1 + 3(n 1) = 4n 2 linear equations for 4n unknowns 2 conditions are missing Additional condition (natural spline): s (x0 ) = 0, s ( xn ) = 0 (6.7)

120 substitution: hi = xi+1 xi (6.1), (6.2) (6.1), (6.2), (6.4) (6.1) (6.1) (6.1) (6.1)

6 Function Approximation

(6.8) (6.9) (6.10) (6.11) (6.12) (6.13) (6.14)

si (xi ) = di = yi 2 si (xi+1 ) = ai h3 i + bi hi + ci hi + di = yi+1 s i ( x i ) = ci si (xi+1 ) = 3ai h2 i + 2bi hi + ci si (xi ) = 2bi =: yi si (xi+1 ) = 6ai hi + 2bi = si+1 (xi+1 ) = yi+1

(6.13), (6.14) ai = (6.13) bi (6.9), (6.10), (6.13), (6.14) ci (6.9) di

1 (y yi ) 6hi i+1 1 y (6.16) = 2 i 1 hi = (yi+1 yi ) (yi+1 + 2yi ) hi 6 = yi

if yi are known, then also ai , bi , ci , di are known. (6.16) in (6.12): si (xi+1 ) = ii1: 1 hi (yi+1 yi ) + (2yi+1 + yi ) hi 6 1 hi1 (yi yi1 ) + hi1 (2yi + yi1 ) 6 (6.17)

si1 (xi ) =

because of si1 (xi ) = si (xi ) (Requirement (6.5)) 1 hi and si (xi ) = ci = (yi+1 yi ) (yi+1 + 2yi ) hi 6 follows 1 hi1 (yi yi1 ) + hi1 1 hi (2yi + yi1 ) = (yi+1 yi ) (yi+1 + 2yi ) 6 hi 6

Sorting of the y -variables to the left results in hi1 yi1 + 2(hi1 + hi )yi + hi yi+1 = for i = 1, 2, . . . , n 1. linear system for y1 , y2 , . . . , yn1 y0 , yn arbitrarily chooseable! y0 = yn = 0: natural spline 6 6 (yi+1 yi ) (yi yi1 ) hi hi1 (6.19)

6.2 Spline interpolation Example 6.6 n = 5 2(h0 + h1 ) h1 0 0 h1 2(h1 + h2 ) h2 0 0 h2 2(h2 + h3 ) h3 0 0 h3 2(h3 + h4 ) with ri = coecient matrix is tridiagonal

121

y1 y2 y3 = r y4

6 6 (yi+1 yi ) (yi yi1 ) hi hi1

Example 6.7 We determine a natural spline interpolant through the points (0, 0), (1, 1), (2, 0), (3, 1). It holds n = 3 and h0 = h1 = 1. The coecient matrix reads 2(h0 + h1 ) h1 h1 2(h1 + h2 ) with the right hand side r1 = 6(y2 y1 ) 6(y1 y0 ) = 12 r2 = 6(y3 y2 ) 6(y2 y1 ) = 12 yielding 4 1 1 4 with the solution y1 = 4, Inserting in (6.16) gives s0 (x) = 2/3 x3 + 5/3 x s1 (x) = 4/3 x3 6 x2 + 23/3 x 2 s2 (x) = 2/3 x3 + 6 x2 49/3 x + 14. with the graph y2 = 4, y0 = y3 = 0 y1 y2 = 12 12 = 4 1 1 4

6.2.2

Correctness and Complexity

122

6 Function Approximation

Denition 6.2 A n n matrix A is called diagonally dominant, if


n

|aii | >
k=1 k=i

|aik |

for i = 1, 2, . . . , n

Theorem 6.4 A linear system A x = b is uniquely solvable, if A is diagonally dominant. In the Gaussian Elimination neither row nor column swapping is needed.

Theorem 6.5 The computation time for the Gaussian elimination method for a tridiagonal matrix A is linear in the lenght n of A. Proof: (see Exercises)

Theorem 6.6 Spline-Interpolation: Let x0 < x1 < . . . < xn . There is a unique cubic spline interpolant s(x) with y0 = yn = 0 (natural Spline). It can be calculated in linear time (O(n)) by the method described above (by using the tridiagonal matrix algorithm, see exercise). The Tridiagonal Algorithm Elimination: m := ck1 /bk1 bk := bk m ck1 k = 2, . . . , n dk := dk m dk1 Backward substitution: dn := dn /bn dk := (dk ck dk+1 )/bk k = n 1, . . . , 1 xk = d k Proof: b1 c1 0 . . . c1 .. . .. . 0 .. . .. . .. . 0 0 . . .

0 0

0 .. . 0 .. . cn1 cn1 bn

x= dn1 dn

d1 d2 . . .

6.2 Spline interpolation 1. Existence and uniqueness Let x0 < x1 < . . . < xn hi = xi+1 xi > 0 2(hi1 + hi ) > hi1 + hi matrix diagonally dominant and uniquely solvable ai , bi , ci , di uniquely determined spline interpolant uniquely determined 2. Computation time (see Exercises) Other conditions: 1. y0 = yn = 0 (natural spline) 2. y0 = s (x0 ), yn = s (xn ) (s given) 3. y0 = y1 , yn = yn1 (s constant on the border) 4. s given at the border (best choice if s (x0 ), s (xn ) is known) 5. if y0 = yn : y0 = yn , y0 = yn (periodic condition)
y

123

y0=yn x

x0

xn

Figure 6.9: periodic condition at spline interpolation.

6.2.3

Interpolation of arbitrary curves


k 1 2 . . . xk x1 x2 . . . yk y1 y2 . . . yn

Example 6.8 Airfoil:

given: value table

n xn

The curve is not a function, therefore, naive interpolation is not applicable. Parameter representation (parameter t)

124
y

6 Function Approximation

...

P4

P3

P2

P1

P0

Figure 6.10: parametric plot of the given value pairs.

(xk , yk )
@ @ R @

(tk , xk )

(tk , yk )

(tk , xk ), (tk , yk ) unique, if (tk ) for k = 1, . . . , n monotonically increasing! k 0 1 2 . . . n ideal choice of tk : arc length good choice of tk : t0 = 0, tk = tk1 + ||Pk Pk1 || = tk 1 + (xk xk1 )2 + (yk yk1 )2 k = 1, 2, . . . , n tk 0 1 2 . . . n xk x0 x1 x2 . . . xn tk 0 1 2 . . . n yk y0 y1 y2 . . . yn

Simplest choice of tk : tk = k

Calculation of the spline curve 1. Computation of the spline function for (tk , xk ) x(t) 2. Computation of the spline function for (tk , yk ) y (t) 3. spline curve dened by: x = x(t) y = y (t) for 0 t tn

6.3 Method of Least Squares and Pseudoinverse

125

6.3
6.3.1

Method of Least Squares and Pseudoinverse


Minimization according to Gauss

Given: n measurements, i.e. value pairs (x1 , y1 ), . . . , (xn , yn ) function f (x, a1 , . . . , ak ) = f (x) k n Sought: Values for a1 , . . . , ak such, that
n

E (f (x1 ) y1 , . . . , f (xn ) yn ) =
i=1

(f (xi ) yi )2

gets minimal! Simplication: f is a linear combination of functions f (x, a1 , . . . , ak ) = a1 f1 (x) + a2 f2 (x) + + ak fk (x) E extremal j = 1, . . . , k :
n

(6.20)

E =0 aj

E (. . .) =
i=1

(a1 f1 (xi ) + + ak fk (xi ) yi )2


n k

E =2 aj E =0 aj
k

al f l ( x i ) y i
i=1 n k l=1

f j ( xi )
n

al fl (xi )fj (xi ) =


i=1 l=1 n n i=1

yi fj (xi )

l=1

al
i=1

fl (xi )fj (xi ) =


i=1 Ajl

yi fj (xi )
bj

l=1

Ajl al = bj

for (j = 1, . . . , k )

(6.21)

linear system of equations for the parameters a1 , . . . ak (Normal equations!) Solving of the normal equations gives a1 , . . . , ak . Note: normal equations are usually (not always) uniquely solvable (see Theorem 6.7).

Example 6.9 With the method of least squares the coecients a1 , a2 , a3 of the function f (x) = a1 x2 + a2 x + a3 using the given points (0, 1), (2, 0), (3, 2), (4, 1) are to be determined. First, we set up the normal equations:
k

Ajl al = bj
l=1

for (j = 1, . . . , k )

126 with Ajl =


i=1 n n

6 Function Approximation

fl (xi )fj (xi ), x4 i x3 i x2 i


n i=1 n i=1 n i=1

bj =
i=1 n i=1 n i=1 n i=1

yi fj (xi ).

It follows:

A=

n i=1 n i=1 n i=1

x3 i x2 i xi

353 99 29 x2 i xi = 99 29 9 1 29 9 4

and

b=

n 2 i=1 yi xi n i=1 yi xi n i=1 yi

34 = 10 2

The solution of this linear system is a1 = 3/22, a2 = 127/110, a3 = 61/55, because 3 22 353 99 29 34 99 29 9 10 = 127 110 29 9 4 2 61 55 The resulting parabola has the following form:
2 1.5 1 0.5 1 -0.5 -1 2 3 4

6.3.2

Application: rectication of photos

In RoboCup, so-called OmniCams are used. These are digital cameras that take a 360degree picture via a parabolic mirror (see g. 6.11). The mirror distorts the image considerably. With the Formula of mirror curvature a formula for conversion of pixel coordinates into real distances on the eld can be derived. Because this formula critically depends on adjustments of the camera, the mirror, the image can not be rectied completely. Therefore, to determine the transformation of pixel distances into real distances we approximate an polynomial interpolation. White markings are pasted on the eld at a distance of 25cm (g. 6.12) and the pixels distances to the center are measured. This gives the following value table:
dist. d [mm] 0 250 500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 4250 pixel dist. x 0 50 108 149 182 209 231 248 263 276 287 297 305 313 319 325 330 334

6.3 Method of Least Squares and Pseudoinverse

127

Figure 6.11: The RoboCup robot Kunibert with upward-pointing camera and mirror (left) and a distorted picture of the eld.
Data Polynom

5000

4000

3000
mm

2000

1000

0 0 50 100 150
p ixe l

200

250

300

350

Figure 6.12: The markers for the interpolation on the eld are shown in the left and the graph of the interpolating polynomial d(x) is in the right diagram.

Figure 6.13: Modied image after the edge detection (left) and the rectied image after application of the transformation (right).

128

6 Function Approximation

Now a polynomial of degree 6 (calculated with the method of least squares) is tted to the points. We get: d(x) = 3.02 1011 x6 2.57 108 x5 + 8.36 106 x4 1.17 103 x3 + 6.85 102 x2 + 3.51 x + 6.79 101 Fig. 6.13 shows the image before and after the transformation. Theorem 6.7 The normal equations are uniquely solvable if and only if the vectors f1 (x1 ) fk (x1 ) . . . . ,..., . . f1 (xn ) fk (xn ) are linearly independent.
y

f2(x)

f 1 ( x1 ) f2 (x1 ) . . . . = 2 . . f 1 ( xn ) f2 (xn ) f1 and f2 are nicht linearly independent on the grid (x1 , . . . , xn ).
x

f1(x)

x1

x2

x3

x4

x5

x6

x7

x8

...

xn

Proof: Normal equations uniquely solvable A non-singular


n

Ajl =
i=1

fl (xi )fj (xi ) fk (x1 ) . . . fk (xn )

A = FTF

f1 (x1 ) . . mit F = . f1 (xn )

Assumption: F T F is singular z = 0 : F T F z = 0 zT F T F z = F z
k 2 2

=0

Fz = 0
i=1

a i zi = 0 (a i =i-th colunm of F)

columns of F are linearly dependent contradiction to the assumption of Theorem 6.7

6.3 Method of Least Squares and Pseudoinverse

129

Example 6.10 We now show that the method of least squares is actually applicable in the example 6.9 and that the coecients are uniquely determined. According to Theorem 6.7 the following vectors must be linearly independent: 0 1 0 f1 (x1 ) f2 (x1 ) f 3 ( x1 ) 4 . . . , v2 = 2 , v3 = 1 . . . v1 = = = = . . . 9 3 1 f1 (x4 ) f ( x ) f ( x ) 2 4 3 4 16 4 1 If v1 , v2 , v3 are linear independent, there must be real numbers a, b, c = 0, so that 0 0 1 4 2 1 a 9 + b 3 + c 1 = 0. 16 4 1 Assume there are such Numbers a, b, c. Then it follows immediately c = 0 out of which 0 0 4 2 a 9 + b 3 = 0. 16 4 follows. But this means, v1 must be a multiple of v2 . This is obviously not the case. So v1 , v2 , v3 are linear independent.

6.3.3

Special Case: Straight Line Regression


n

regression line f (x, a, b) = ax + b E=


i=1 n

(axi + b yi )2

E = 2 a E = 2 b
n

(axi + b yi )xi = 0
i=1 n

(axi + b yi ) = 0
i=1 n n

a
i=1

x2 i a

+b
i=1 n

xi =
i=1 n

xi yi yi
i=1

xi + nb =
i=1

Solution: a = b = n xi yi xi yi n x2 xi )2 i ( x2 yi xi xi yi i n x2 xi ) 2 i (

130 Remains to be shown: The solution


a b

6 Function Approximation of degree E=0 is a minimum!

6.3.4

Statistical Justication

The method of least squares can be justied well with statistical methods. Here this is done only for one special case. Let f (x) = c be the constant function and c be sought.
y

x x1 x2 ... xn

Figure 6.14: Mean over all function values.


n n

E=
i=1

(f (xi ) yi ) =
2 i=1

(c yi )2

E = 2 c

(c yi ) = 2
i=1 n i=1

c
i=1

yi

= 2 nc
i=1

yi
n

=0

nc =
i=1

yi

c=

1 n

yi arithmetic mean
i=1

Errors of the coecients ai Because of measurement errors in (xi , yi ), the coecients a1 , . . . , ak are erroneous.Calculation of the errors a1 , . . . , ak out of y1 , . . . , yn with the law of error propagation (maximum error).1 n ai ai = yj yj j =1 For many measurements, the formula for the maximum error gives a too large value. A better approximation is obtained by the formula for the mean Error
1

yi is the absolute value of the maximum expected measurement error of variable yi .

6.3 Method of Least Squares and Pseudoinverse

131

ai =
j =1

ai yj

(yj )2

Special Case Straight Line Regression: a yj b yj 1 = N 1 = N


n

nxj
i=1 n

xi
n

x2 i
i=1

i=1

xi
2

xj

with N = n

x2 i

xi

a b =

=
j =1 n b j =1 yj

a yj yj

yj

a+da

a-da

b+db b b-db x

Figure 6.15: regression line through value pairs. Nonlinear Regression (Examples): Power function: v = c ud Constants c, d sought! log v = log c + d log u y := log v, x := log u a1 = log c, a2 = d y = a1 + a2 x

132 Exponential function: v = Aebu A, b sought ln v = ln A + bu

6 Function Approximation

y := ln v,

x := u,

a1 = ln A,

a2 = b

y = a1 + a2 x

6.3.5

Multidimensional Least Squares

The method presented so far is good for the approximation of functions f : R R, i.e. for one-dimensional functions with one-dimensional argument. In the setting of Equation 6.20 we determine the coecients a1 , . . . , ak of a linear combination of one-dimensional basis functions f1 , . . . , fk : f (x) = a1 f1 (x) + + ak fk (x) = a T f (x). (6.22) Now, there is a very easy generalization of this ansatz to multidimensional input. We just replace the one-dimensional x by a vector x to obtain f (x ) = a1 f1 (x ) + + ak fk (x ) = a T f (x ). In the derivation of the normal equations, proof, etc. there are no changes other than replacing x by a vector. A dierent way to get into the multidimensional world is the ansatz f ( x ) = a1 x 1 + + ak x k = a T x . The advantage here is that we do not have to worry about the selection of the basis functions fi . But there is no free lunch. The drawback is the very limited power of the linear approximation.

6.3.6

A More General View


f (x ) = a1 f1 (x ) + + ak fk (x ) = a T f (x )

We still want to t a function

with k unknown parameters a1 , . . . , ak through the n data points (x 1 , y1 ), . . . , (x n , yn ). If we substitute all the points into the ansatz, requiring our function to hit all n points, i.e. f (x i ) = yi , we get the linear system a1 f1 (x 1 ) + . . . + ak fk (x 1 ) = . . . . . . If we dene the n k -matrix M as Mij = fj (x i ), y1 . . . (6.23)

a1 f1 (x n ) + . . . + ak fk (x n ) = yn .

6.3 Method of Least Squares and Pseudoinverse Equation 54 reads M a = y.

133

For n > k the system is overdetermined and normally has no solution. In the next section, we will show how to nd an approximate solution by using the method of least squares. For the case n = k we may get a unique solution, because here M is a square matrix. If we use for j = 0, . . . , k the basis functions fj (x) = xj , we end up with the Vandermonde matrix from Section 6.1.2.

6.3.7

Solving Overdetermined Linear Systems


x1 + x2 + x3 x1 + x2 x1 + x3 x2 + x3 = = = = 1 1 1 1

The linear System

is not solvable, because it is overdetermined. Even though we have to accept this fact, we can ask, which vektor x fullls the linear system best. This can be formalized as follows: Given, an overdetermined linear system Mx = y with n equations and k < n unknowns x1 , . . . xk . M is a n k matrix, x Rk and y Rn . Obviously, in general, there is no vector x , for which M x = y . Therfore we are looking for a vector x , which makes the left side as good as possible equal to the right side. That is, for which M x y , or for which ||M x y ||2 = (M x y )2
2

gets minimal. It also follows that (M x y )2 gets minimal. So


n n k

((M x )i yi )2 =
i=1 i=1 l=1

Mil xl yi

must be minimal. To determine the minimum we set all partial derivatives equal to zero: xj
n k 2 n k

Mil xl yi
i=1 l=1

=2
i=1 l=1

Mil xl yi

Mij = 0

and get after multiplying out


n k n

Mil Mij xl =
i=1 l=1 i=1

Mij yi
n

or

n T Mji Mil

xl =
i=1

T Mji yi

l=1

i=1

or as a vector equation MT Mx = MT y. Therewith we have derived the following theorem (6.24)

134

6 Function Approximation

Theorem 6.8 Let an overdetermined linear system M x = y with x Rk , y Rn (n > k ) with least squared error can be determined and the n k matrix M be given. The solution x by solving the linear system = MT y. MT Mx This system has a unique solution if and only if the matrix M has full rank (This proposition is equivalent to theorem 6.7.). Please note that Equation 6.24 is identical to the normal equations (Equation 6.21, proof as exercise.) This linear system can be rewritten into = (M T M )1 M T y . x If M is invertible and the system M x = y is uniquely solvable, then the solution x can be calculated by x = M 1 y . , it is clear why the square matrix Comparing this equation with the above for x (M T M )1 M T is called pseudoinverse of M . The matrix M T M is the so called Gram matrix of M . Now we apply the theorem to the example at the beginning of the section which reads 111 1 110 1 101 x = 1 . 011 1 Application of Theorem 6.8 delivers 1 1110 1 MT M = 1 1 0 1 1 1011 0 and the equation 322 3 2 3 2 x = 3 223 3
3 3 3 T = (7 with the solution x , 7, 7) .

1 1 0 1

1 322 0 232 = 1 223 1

6.3.8
Let

Solving Underdetermined Linear Systems


Mx = y

be an underdetermined linear system with n equations and k > n unknowns x1 , . . . xk . So M is a n k matrix, x Rk and y Rn . Obviously, there are in general innitely many vectors x , with M x = y . So we can choose any one of these vectors. One way for this choice is to choose out of the set of solution vectors x one with minimal square norm x 2 .

6.3 Method of Least Squares and Pseudoinverse

135

The task to determine such a vector, can also be formulated as constrained extremum problem. A minimun of x 2 under n constraints M x = y is sought. With the method of Lagrange parameters it is x 2 + T (y M x ) For this scalar function, the gradient must become zero: ( x
2

+ T (y M x )) = 2x M T = 0

Multiplying the second equation from the left with M results in 2M x M M T = 0. Insertion of M x = y leads to 2y = M M T and = 2(M M T )1 y . With 2x = M T , we get x = 1/2 M T = M T (M M T )1 y . The matrix M T (M M T )1 is now a new pseudoinverse. (6.25)

6.3.9

Application of the Pseudoinverse for Function Approximation

Let k basis functions f1 , . . . , fk and n data points (x1 , y1 ), . . . , (xn , yn ) be given. We want to determine parameters a1 . . . ak for f (x) = a1 f1 (x) + . . . ak fk (x), such that for all xi the equation f (xi ) = yi is fullled as good as possible. For the three cases n < k , n = k and n > k we present examples. First, we determine the seven coecients of the polynomial f (x) = a1 + a2 x + a3 x2 + a4 x3 + a5 x4 + a6 x5 + a7 x6 with the help of the points (1, 1), (2, 1), (3, 1), (4, 1), (5, 4). Inserting the points results the underdetermined system of equations 1 1 1 1 1 1 1 1 1 2 4 8 1 16 32 64 1 3 9 27 81 243 729 a = 1 . 1 4 16 64 256 1024 4096 1 1 5 25 125 625 3125 15625 4 Computing the pseudoinverse and solving for a yields a T = (0.82, 0.36, 0.092, 0.23, 0.19, 0.056, 0.0055). The result is shown in Figure 6.16, left. We recognize that here, despite the relatively high degree of the polynomial a very good approximation is achieved, (why?).

136

6 Function Approximation

Reducing the degree of the polynomial to four, gives a quadratic matrix. It consists of the rst ve columns of the matrix above and the system becomes uniquely solvable with a T = (4., 6.25, 4.38, 1.25, 0.125). In Figure 6.16 (middle) oscillations can be seen, which are due to signicantly larger absolute values of the coecients. After a further reduction of the polynomial to two, only the rst three columns of the matrix remain and the solution via pseudoinverse delivers the least squares parabola with the coecients a T = (2.8, 1.97, 0.429) as shown on the right in g. 6.16.
4 4 4

5 1

5 1

Figure 6.16: Polynomial tted to data points in the underdetermined (k = 7, n = 5, left), unique (k = 5, n = 5, center) and overdetermined (k = 3, n = 5, right) case. We see that the work with underdetermined problems can be quite interesting and can lead to good results. Unfortunately this is not always the case. If we try for example, like in the example of the polynomial interpolation of g. 6.7 with xed number of 11 given points, to increase the degree of the polynomial, then, unfortunately, the oscillations increase too, instead of decrease (see g. 6.17). The parametric methods usually require some manual inuencing. In the next section we describe Gaussian processes a method that works very elegantly and requires minimal manual adjustments.
1
1.0

0.8
0.8

0.6

0.6

0.4

0.4

0.2

0.2

-1

-0.5

0.5

1.0

0.5

0.5

1.0

Figure 6.17: Ordinary Chebychef interpolation (left and Figure 6.7) with 11 points leading to a Polynomial of degree 10 and the solution of the underdetermined system for a polynomial of degree 12 with the same points (right) yielding somewhat higher error.

6.4 Exercises

137

6.3.10

Summary

With the method of least error squares and minimizing the square of the solution x , we have procedures to solve over and underdetermined linear systems. But there are also other methods. For example, in the case of underdetermined systems of equations, instead of determining x 2 , we could e.g. maximize the entropy
k

i=1

xi ln xi

or determine an extremum of another function x . The methods presented here are used mainly, because the equations to be solved remain linear. The computing time for calculating the pseudoinverse can be estimated in underdetermined and in overdetermined case by O(k 2 n + k 3 ). Slightly faster than the calculation of (M M T )1 it is using the QR decomposition or the Singular Value Decomposition (SVD). Then the time complexity is reduced to O(k 2 n). The here calculated pseudoinverses are so-called MoorePenrose pseudoinverses. That is, in the case of a matrix M with real-valued coecients, the pseudoinverse M + has the following features: M M +M = M M +M M + = M + Applied on M , M M + behaves indeed like an identity matrix.

6.4

Exercises

Polynomial Interpolation
Exercise 6.1 a) Let the points (1, 1), (0, 0) and(1, 1) be given. Determine the interpolation polynomial through these three points. b) Let the points (1, 1), (0, 0), (1, 1) and (2, 0).be given. Determine the interpolation polynomial through these four points. Exercise 6.2 a) Write a Mathematica program that calculates a table of all coecients of the interpolating polynomial of degree n for any function f in any interval [a, b]. Pass the function name, the degree of the polynomial and the value table as parameters to the program. The Mathematica functions Expand and Coefficient may be useful. b) Write for the value table generation a program for the equidistant case and one for the Chebyshev abscissas. Exercise 6.3 2 a) Apply the program of exercise 6.2 to the interpolation of the function f (x) := ex in the interval [2, 10] and calculate the polynomial up to the 10th degree. The given points are to be distributed equidistant. Exercise 6.4 a) Calculate the maximum norm of the deviation between the interpolation polynomial p and f from exercise 6.3 on an equidistant grid with 100 given points.

138

6 Function Approximation

b) Compare the Equidistant interpolation with the Chebyshev interpolation and with the Taylor series of f of degree 10 (expanded around x0 = 0 and x0 = 4, use the function ( tt Series)) with respect to maximum norm of the approximation error.

Spline-Interpolation
Exercise 6.5 Given two points (1, 1) and (2, 0) for computing a cubic spline with natural constraints (y0 = yn = 0). a) How many lines and columns has the tri-diagonal matrix for computing the y -variables? b) Determine the spline by manually calculating the coecients ai , bi , ci , di Exercise 6.6 The points (1, 1), (0, 0) and(1, 1) are given. a) Determine the two cubic part splines with natural boundary conditions. b) Why s0 (x) = x2 and s1 (x) = x2 is not a cubic spline function with natural boundary conditions? Argue unrelated to the correct solution. Exercise 6.7 How does the coecient matrix for the spline interpolation change, if instead of the boundary conditions y0 = yn = 0, the boundary conditions y0 = y1 , yn = yn1 (second derivative at the border) would be demanded? Change the coecient matrix of example 7.1 accordingly. Exercise 6.8 Program the tridiagonal matrix algorithm. Exercise 6.9 table. Write a program to calculate a natural cubic spline out of a given value

Exercise 6.10 Apply the program from Exercise 6.9 on the interpolation of the function 2 f (x) := ex in the interval [2, 10] on a equidistant Grid with 11 points. Exercise 6.11 Iterated Function Systems (IFS): a) Calculate the value tables of the two sequences (xn ), (yn ) with xn+1 = a yn + b yn+1 = c xn + d x0 = y 0 = 1 to n = 20, where use the parameter values a = 0.9, b = 0.9, c = 0.9, d = 0.9. b) Connect the points (x0 , y0 ) . . . (xn , yn ) with a cubic natural spline. Select as parameter for the parametric representation the points euclidean distance.

Least Squares and Pseudoinverse


Exercise 6.12 With the method of least squares the coecients a1 , a2 of the function a1 a2 f ( x) = x 2 + (x9)2 using the given points (1, 6), (2, 1), (7, 2), (8, 4) are to be determined. a) Set up the normal equations. b) Calculate the coecients a1 , a2 . c) Draw f in the interval (0, 9) together with the points in a chart. Exercise 6.13

6.4 Exercises a) Write a Mathematica program to determine the coecients a1 . . . ak of a function f (x) = a1 f1 (x) + a2 f2 (x) + + ak fk (x)

139

with the method of least squares. Parameters of the program are a table of data points, as well as a vector with the names of the base functions f1 , . . . , fk . Try to work without for loops and use the function (LinearSolve). b) Test the program by creating a linear equation with 100 points on a line, and then use your program to determine the coecients of the line. Repeat the test with slightly noisy data (add a small random number to the data values). c) Determine the polynomial of degree 4, which minimizes the sum of the error squares of the following value table (see: http://www.hs-weingarten.de/~ertel/vorlesungen/ mathi/mathi-ueb15.txt):
8 9 10 11 12 13 14 15 16 17 -16186.1 -2810.82 773.875 7352.34 11454.5 15143.3 13976. 15137.1 10383.4 14471.9 18 19 20 21 22 23 24 25 26 27 8016.53 7922.01 4638.39 3029.29 2500.28 6543.8 3866.37 2726.68 6916.44 8166.62 28 29 30 31 32 33 34 35 36 37 10104. 15141.8 15940.5 19609.5 22738. 25090.1 29882.6 31719.7 38915.6 37402.3 38 39 40 41 42 43 44 45 46 41046.6 37451.1 37332.2 29999.8 24818.1 10571.6 1589.82 -17641.9 -37150.2

d) Calculate to c) the sum of the squares. Determine the coecients of a parabola and calculate again the sum of the error squares. What dierence do you see? e) Which method allows you to determine experimentally, at several possible sets of basis functions, the best? f ) Find a function which creates an even smaller error. Exercise 6.14 Given: (0, 2), (1, 3), (2, 6). Determine with the method of least squares the coecients c and d of the function f (x) = c edx . Note that the parameter d occurs nonlinear! Exercise 6.15 a) Change the right hand side of the rst system of equations at the beginning of Section 6.3.7, so that it gets uniquely solvable. b) Which condition must hold, such that a linear system with n unknowns and m > n equations is uniquely solvable? Exercise 6.16 Use Theorem 6.8 to solve the system of equations x1 = 1, x1 = 2, x2 = 5 , x2 = 9 , x3 = 1 , x3 = 1 by the method of least squares. Exercise 6.17 Show that for the pseudoinverse M + of the sections 6.3.7 and 6.3.8 it holds M M + M = M andM + M M + = M + . Exercise 6.18 Show that the computing time for the calculation of the pseudoinverse in sections 6.3.7 and 6.3.8 can be estimated by O(k 2 n + k 3 ). Exercise 6.19 Prove that the equation M T M x = M T y for the approximate solution of an overdetermined linear system M x = y (Equation 6.24) is equivalent to the normal equations from the least squares method (Equation 6.21). Exercise 6.20 Given M , M= 8 2 2 2 4 1 (6.26)

140

6 Function Approximation

a) Perform the SVD decomposition and write M in the form M = U V T . b) Compute the pseudoinverse M + of M . c) Show that M + is a valid (Moore-Penrose) pseudoinverse. d) Show that the pseudoinverse of M , using the technique of the underdertemined system mentioned in section 6.3.8, is the same as the one computed by SVD. Exercise 6.21 Given the following Matrix M , 3 6 M = 2 4 2 4 a) Show that the pseudoinverse of the matrix M , using the technique of the overdetermined system mentioned in section 6.3.7, is not applicable. b) Perform the SVD decomposition and write M in the form M = U V T . c) Compute the pseudoinverse M + of M . d) Show that M + is a valid pseudoinverse.

Chapter 7 Statistics and Probability


7.1
7.1.1

Random Numbers
Applications of Random Numbers
Randomized Algorithms Stochastic Simulation (Monte-Carlo simulation) Cryptography (e.g., key generation, one-time pad)

Literature: Don Knuth The Art of Computer Programming volume 2 In [19] U. Maurer gives a good denition of randomness: Denition 7.1 A random bit generator is a device that is designed to output a sequence of statistically independent and symmetrically distributed binary random variables, i.e., that is designed to be the implementation of a so-called binary symmetric source (BSS). In contrast, a pseudo-random bit generator is designed to deterministically generate a binary sequence that only appears as if it were generated by a BSS.

Denition 7.2 A binary variable is symmetrically distributed if the probability for both values is exactly 1/2. A sequence is random, if f for any length maximum entropy. the distribution of all strings of length has

Denition 7.3 A Pseudo Random Number Generator (PRNG) is an algorithm that (after entering one or more seed numbers) deterministically generates a sequence of numbers. For cryptographic applications very problematic! Alternative:

142

7 Statistics and Probability

Use of physical random events such as thermal noise or radioactive Decay: True Random Numbers True Random Number Generator (true RNG). philosophy: Till recently it is unknown if hidden parameters are describing a seemingly random process deterministically. Physicist have proven that there are real random processes.

7.1.2

Kolmogorov Complexity

If a (large) le can be compressed, then the content is not random. True random numbers can not be compressed! Is (31415926 . . .) random? No, because = 3.1415926 . . . can be compressed Computer program can calculate any number of digits of !

Denition 7.4 The Kolmogorov complexity of a (innite) sequence is the length of a shortest program, that can compute (enumerate) the sequences terms [28]. has nite Kolmogorov complexity. Any sequence of random numbers has innite Kolmogorov complexity! Unsuitable in practice, since the Kolmogorov complexity is not computable! Each PRNG only produces sequences of nite Kolmogorov complexity. Such sequences are not random

7.1.3

Compression of Random Number Sequences

Theorem 7.1 No program can compress any les of at least n-bit (n 0) whithout loss. Example 7.1 length n 0 1 2 3 bit sequences of length n number 1 0, 1 2 00, 01, 10, 11 4 000, 001, 010, 011, 100, 101, 110, 111 8

8 sequences of length 3, but only seven shorter! Proof: Suppose a program could do it. We compress with it (only!) all les of n-bit. The compressed les are not exceeding the size of n 1 bits. The number of compressed les from of size 0 bis n 1 bits is 1 + 2 + 4 + 8 + . . . + 2n1 = 2n 1. Because there are 2n les of size n bits, at least two les have to be compressed to the same le. Thus, the compression is not lossless.

7.1 Random Numbers

143

7.1.4

Pseudo Random Number Generators

Denition 7.5 Linear Congruence Generators are dened recursively by xn = (axn1 + b) mod m. with parametrs a, b and m. [29] recommends for 32-bit integers a = 7141, b = 54773 and m = 259200. The period is not exceeding m. Why? (see exercise 7.3) Theorem 7.2 The functional characteristics of a congruence generator lead to the following upper bounds for the period: recursion scheme xn = f (xn1 ) mod m xn = f (xn1 , xn2 ) mod m xn = f (xn1 , xn2 , xn3 ) mod m ... period m m2 m3

Proof: With the modulus m we have only m dierent values for xn . Since f is deterministic, if xn = f (xn1 ) mod m, after the rst repeated value, all succeeding values will be repeated as well. Thus the period is m. If f depends on two previous values, then there are m2 combinations. Thus the period is bounded by m2 and so on. Apparently, the more predecessors xn depends on, the longer the period can become. So it seems natural to use as many predecessors as possible. We try it with the sum of all predecessors and get
n1

x0 = a,

xn =
i=0

xi

mod m,

which may even lead to a non-periodic sequence, because the number of used predecessors gets bigger with increasing m. Let us rst consider the specied sequence with x0 = 1 non-modular: 1, 1, 2, 4, 8, 16, 32, 64, 128, 256, . . . Obviously this is an exponential sequence, hence Theorem 7.3 The recursively dened formula x0 = 1, xn = to xn = 2n1 . Proof: For n 2 we have
n1 n2 n1 i=0

xi for n 1 is equivalent

xn =
i=0

xi = xn1 +
i=0

xi = xn1 + xn1 = 2 xn1 .

144

7 Statistics and Probability

For n = 1, x1 = x0 = 1. Now it can be shown easily by induction, that xn = 2n1 for n 1 (see exercise 7.3).
n1 n1 For the modular sequence x0 = 1, xn = mod m i=0 xi mod m is equivalent to xn = 2 for n 1. Thus xn depends only on xn1 and m is the periods upper bound. The period of the sequence is even m 1, because when zero is reached, the result will remain zero. Not only the period is important for the quality of a PRNG. The symmetry of the bits should as well be good.

7.1.5

The Symmetry Test

In principle, it is easy to test a bit sequence on symmetry. The mean of an n-bit sequences has to be calculated n 1 xi M (X1 , . . . , Xn ) = n i=1 and compared with the expected value E (X ) = 1/2 of a true random bit sequence. If the deviation of the mean from the expected value is small enough, the sequence passes the test. Now we want to calculate a threshold for the tolerable deviation. The expected value of a true random bit X is E (X ) = 1/2 and also its standard deviation (X ) = 1/2 (see exercise 7.4). The mean of n true random numbers, will deviate less from the expected value, the larger n gets. The central limit theorem (Theorem 4.4) tells us that for n independent identically distributed random variables X1 , X2 , . . . , Xn with standard deviation , the standard deviation of the sum Sn = X1 + . . . + Xn is equal to n . Thus, the standard deviation n of the mean n 1 M (X1 , . . . , Xn ) = xi n i=1 of n random bits is n = 1 1 n (X1 ) = (X1 ) n n

Because for random bits (Xi ) = 1/2, we get 1 n = . 2 n A normally distributed random variable has a value in [ 2, + 2 ] with probability 0.95. This interval is the condence interval to the level 0.95. We dene the test of randomness as passed, if the mean of the bit sequence is in the interval [1/2 2n , 1/2 + 2n , ]. 7.1.5.1 BBS Generator (Blum Blum Shub)

Even polynomial congruential generators of the form


k 1 xn = (ak xk n1 + ak1 xn1 + . . . + a0 ) mod m.

can be cracked. Therefore, it is natural to look for better generators. A PRNG that generates bits of very high quality, is the so-called BBS generator (see [23]). Choose primes p and q with p q 3 mod 4.

7.1 Random Numbers Calculate n = p q and choose a random number s, with ggT(s, n) = 1. Calculate the Seed x0 = s2 mod n. The generator then repeatedly computes (starting with i = 1) xi = (xi1 )2 mod n bi = xi mod 2, and outputs bi as the i-th random bit. BBS is considered very good, but: A BBS operated One-Time-Pad is as safe as a cipher with a key length of |s|.

145

7.1.6

Linear Feedback Shift Registers

Denition 7.6 A shift register of length n consists of a bit vector (xn , . . . , x1 ). In each step, the bits are shifted one position to the right, i.e. xn xn1 , . . . , x2 x1 and a new bit In will be inserted to the left and the last bit Out will be output: In xn , x1 Out . A Linear Feedback Shift Register (LFSR) computes the new input (In) by modulo 2 addition of certain bits of the register. Example 7.2 LFSR1 : x3 1 0 1 1 0 x2 1 1 0 1 1 x1 1 1 1 0 1 Out 1 1 1 0

x3
Period 3.

x2

x1

Example 7.3 LFSR2 has the period 7: x3 1 0 1 0 0 1 1 1 x2 1 1 0 1 0 0 1 1 x1 1 1 1 0 1 0 0 1 Out 1 1 1 0 1 0 0

146 The maximum period of a LSFR of length n is 2n 1. Why? Example 7.4 Analysis of a LFSR of length 3 We look at the bit sequence B = (01110010) and search the parameters a1 , a2 , a3 .

7 Statistics and Probability

a3

a2

a1

The LFSR can be represented mathematically by mapping (x3 , x2 , x1 ) (a1 x1 a2 x2 a3 x3 , x3 , x2 ), repeatedly. The rst three bits of the sequence B represent the state of the LFSR at a specic time, i.e. x1 = 0, x2 = 1, x3 = 1 State of the LFSR: (1, 1, 0) For each time unit later we get the state (1, 1, 1) = (a2 a3 , 1, 1) (0, 1, 1) = (a3 a2 a1 , 1, 1) (0, 0, 1) = (a2 a1 , 0, 1) From (7.1), (7.2), (7.3) we obtain the equations a2 a3 = 1 a3 a2 a1 = 0 a2 a1 = 0 and calculate (7.4) in (7.5) : 1 a1 = 0 a1 = 1 (7.7) in (7.6) : a2 1 = 0 a2 = 1 (7.8) in (7.4) : a3 = 0 Thus the shift register has the form (7.7) (7.8) (7.9) (7.4) (7.5) (7.6) (7.1) (7.2) (7.3)

and the sequence of states of a period of LFSR3 is 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0

7.1 Random Numbers

147

Note that for analysis, only six bits of the Output sequence were used and that LFSR3 has maximum period. In general it can be shown, that for analysing a linear shift register at most 2n bits are required of the output sequence. (Berlekamp-Massey-Algorithm) Denition 7.7 The Linear Complexity of a sequence is the length of the shortest LFSR that can generate the result.

If a key sequence has nite linear complexity n, then only 2n sequence bits are required to crack the code of the corresponding stream cipher. Kolmogorov Complexity.

7.1.7

True Random Numbers


Physical Noise Source, AD converter, Amplier, Filter, Test(?) Special Hardware (Thermal Noise) for test purposes Special Hardware for cryptographic applications are too expensive

Special Hardware

Intel: thermal noise of a resistor in the Pentium III processor Frequency: 75000 bits per second [30]

Maxtor: Noise of IDE Hard Drives Frequency: 835 200 bits per second [21] The Neumann Filter

7.1.7.1

John von Neumann, 1963, invented the following formula for repairing asymmetric sequences: f: 00 11 01 10 0 1,

= the empty character string Example 7.5 Example 7.6 Example 7.7 10001101011100101110 11111111111111111111 10101010101010101010 1111111111 10011

148

7 Statistics and Probability

Theorem 7.4 Are consecutive bits in a long (n ) bit sequence statistically independent, then after application of the Neumann Filter they are symmetrically distributed. The length of the bit sequence is shortened by the factor p(1 p). Proof: If in a sequence the bits are independent and with probability p take the value 1, then the probability for a pair of 01 equals p(1 p). The probability for the pair 10 is also p(1 p). Thus, the probability pn for the value 1 after the application of the Neumann Filter is given by p(1 p) = 1/2. pn = 2p(1 p) For the proof of the reduction factor we refer to exercise 7.8.

0.25 0.2 0.15 p(1-p) 0.1 0.05 0

0.2 0.4 0.6 0.8 p

Figure 7.1: Inuence of asymmetry on the yield of the Neumann Filter.

7.2
7.2.1

Calculation of Means - An Application for Functional Equations


Derivation of a suitable Speedup Formula

Task: Runtime comparsion of 3 computers


Computer A: SUN-SPARC classic (to be compared with:) Computer B: PC Pentium 90 xxx Computer C: HP 9000/720

Example 7.8 Running time of Program 1 Computer Time TX CA 10.4 sec CB 8.1 sec CC 7.9 sec Speedup TA /TX 1 1.28 1.32

7.2 Calculation of Means - An Application for Functional Equations Problem 1 Result is not representative. Example 7.9 Running time of Program 2 Computer Time TX CA 2.7 sec CB 4.3 sec CC 2.6 sec Speedup TA /TX 1 0.63 1.04

149

Solution 1 Measure running times on a representative set of benchmarks (based on statistics of the applications of a typical user) Example 7.10 Benchmarks I1 , I2 , I3 Computer I1 CA 1 CB 2 A /T B = 1.93 Speedup T CB is almost twice as fast as CA ? No, only for benchmark I3 ! I2 2 4 I3 T 100 34.3 47 17.7

A /T B is a relative measure, but in the previous example, S is Problem 2 Speedup S1 = T determinated only by Benchmark I3 (largest value).

Denition 7.8 Let x1 , . . . , xn R, then A : Rn R with A(x1 , . . . , xn ) = is the Arithmetic Mean of x1 , . . . , xn . 1 n


n

xk
k=1

Denition 7.9 Let 1 , . . . , n (1 , . . . , n ) be the running times of Computer A (Computer B) on the Benchmarks I1 , . . . , In . Then the Speedup S1 is dened as: S1 (CA , CB ) = A(1 , . . . , n ) = A(1 , . . . , n )
n k=1 k n k=1 k

Solution 2 Calculate the sum of the ratios instead of the ratio of the sums! Denition 7.10 S2 (CA , CB ) = A 1 n ,..., 1 n 1 = n

k=1

k k

150 Application of S2 on the previous example:

7 Statistics and Probability

1 + 1 + 100 1 1 100 47 )= 2 2 = 1.04 S2 (CA , CB ) = A( , , 2 2 47 3 47 2 + 2 + 0.47 S2 (CB , CA ) = A(2, 2, )= = 1.49 100 3

CA faster than CB , or CB faster than CA ? Problem 3 S2 (CA , CB ) = 1 S2 (CB , CA )

Example 7.11 Calculation of the Speedup Computer Runtime. of Benchm. I1 CA 1 CB 10 S2 (CA , CB ) = S2 (CB , CA ) = Expected: S2 = 1! Conjecture: Geometric Mean solves the problem Runtime. of Benchm. I2 10 1 1 1 + 10 = 5.05 10 2

Denition 7.11 G : (R\{0})n R with G(x1 , . . . , xn ) = is the Geometric Mean of x1 , . . . , xn . n x1 . . . xn

Denition 7.12 S3 (CA , CB ) = G is called User Speedup. Remark: S3 solves problems 3 ! S3 (CA , CB ) =
n

1 n ,..., 1 n

k k k=1

k k = = 1/n k k=1 k=1 k

1/n

1
n k=1
1/n k 1/n k

1 S3 (CB , CA )

7.2 Calculation of Means - An Application for Functional Equations

151

7.2.2

Requirements for a speedup function M : Rn + R+

A speedup function of relative quantities must fulll the following functional equations: 1. M (x, . . . , x) = x 2. M (x1 , . . . , xn ) M (y1 , . . . , yn ) = M (x1 y1 , . . . , xn yn ) 3. M (x1 , . . . , xk ) = M (x(1) , . . . , x(k) ) for each permutation on {1, . . . , k } Explanation of requirement 2: 10x CA
-

2x CB 20x
-

CC

S (CA , CB ) S (CB , CC ) = S (CA , CC ) M( 1 n 1 n 1 n , . . . , ) M( , . . . , ) = M( , . . . , ) 1 n 1 n 1 n

M (x1 , . . . , xn ) M (y1 , . . . , yn ) = M (x1 y1 , . . . , xn yn ) Theorem 7.5 The Geometric Mean G(x1 , . . . , xn ) is the one and only function M : Rn + R+ , which fullls the requirements 1,2 and 3. Proof: M (x1 , . . . , xn )n = M (x1 , . . . , xn ) M (x2 , . . . , xn , x1 ) . . . M (xn , x1 , . . . , xn1 ) = M (x1 . . . xn , . . . , x1 . . . xn ) = x1 . . . xn

7.2.3

Application / Case Study: Randomized Depth-First Search

Randomized Algorithms

Denition 7.13 An algorithm A which gets in addition to its input I a sequence of random numbers is called randomized algorithm. Note: In general the runtime of A (for xed inputs I ) depends on the random numbers.

152 Search Tree

7 Statistics and Probability

HH H HH H @ @ @ @ @ @ @ @ @ @ @ @ @ e e e e e u e @e

2
u

3 success

6
e

7 fail

Depth-first-search(Node, Goal) If GoalReached(Node, Goal) Return(Solution found) NewNodes = Successors(Node) While NewNodes = Result = Depth-first-search(First(NewNodes), Goal) If Result = Solution found Return(Solution found) NewNodes = Rest(NewNodes) Return(No solution)

Figure 7.2: The algorithm for depth-rst search. The function First returns the rst element of a list, and Rest the rest of the list.

7.2.4

Depth-First Search

Depth-rst-search searches the binary tree recursively until one solution was found. Randomized Depth-rst-search: random choice of left/right successor. many dierent possible runtimes (runtime distribution) for xed tree. Example 7.12 4 dierent trees each with a solution: n(t)
6

Runtimes: t = 2, 3, 5, 6

@ @ @ @ @ e @e u @e

@ @ @ @ @ @e u e @e

Runtime t = 5

Runtime t = 2

1 1

u -

7.3 Exercises
HH HH H H @ @ @ @ @ @ @ @ @ @ @ @ @ e e e e e e e @e

153 n(t)
6

u u

u u

u u

u u -

3 4

6 7

10 11

13 14

n(t)
  PPP  P
6

  @ @ @ A A A A A A

PP PP P @ @ @ @ @ @ A A A A A A A A A A A A

u u u

u u u

u u u

u u u -

3 4 5

7 8 9

11 12 13

16 17 18

7.2.5

How to measure speedup for such randomized algorithms?


S3 (C1 , Cp ) = G 1 n ,..., 1 n

not meaningful since assignment i i does not exist! but: 1 1 2 2 n n S3 (CA , CB ) = G ,..., ; ,..., ;...; ,..., 1 m 1 m 1 m All possible ratios are calculated. Proceeding as above, but requirement is meaningless. New axioms, thus dierent (more dicult) proof.

7.3

Exercises

Exercise 7.1 Dene the term random number generator in analogy to the term random bit generator. Instead of bits we now allow numbers from a nite set N . Exercise 7.2 Can the the Kolmogorov complexity of a sequence S be measured in practice? Dioscuss this questions with: a) Write pseudocode of an program that nds the shortest C-program that outputs the given sequence S . Based on the grammar of the language C this program generates all C-programs of length 1, 2, 3, . . .. Each generated C-program now is executed and the produced sequence compared with S . b) Which problems appear with this program? c) Modify the program such that it approximates the Kolmogorov complexity of a given sequence S . Exercise 7.3

154

7 Statistics and Probability

a) Why is the period of a Linear Congruential Generator bounded above by the value of the modulus m? How could the generator be modied (for xed m), to increase the period signicantly? b) Experiment with generators of the form xn = (xn1 + xn2 ) mod m and nd cases for x0 and m, where the period is longer than m. c) Consider generators of the form xn = (xn1 + xn2 + xn3 ) mod m and nd cases for x0 and m, where the period is longer than m2 . d) Analyse the generators of the form xn = (xn1 + xn2 + . . . + x0 ) mod m with x0 = 1 on periodicity. e) Prove by induction: If x1 = x0 = 1 and xn = 2 xn1 for n 2, then it also holds xn = 2n1 for n 1. Exercise 7.4 a) Calculate the expected value and standard deviation of a true binary random variable. b) Draw the density function of a sum of 10, 100, 1000, 10 000 good random bits. Use the built-in Mathematica function Random or the Octave function rand. Then determine for each of the sums the sample standard deviation. Exercise 7.5 a) Implement the mentioned linear congruential generator of the form xn = (axn1 + b) mod m with a = 7141, b = 54773 and m = 259200 in a programming language of your choice. b) Test this generator on symmetry and periodicity. c) Repeat the test after applying the Neumann Filter. Exercise 7.6 a) Show that the bit sequence 110110110101010101010 passes the symmetry test. b) Would you accept this sequence as a random bit sequence? Why? c) Why is the symmetry test not sucient to test the quality of a random number generator? d) Suggest dierent randomness tests and apply them to the sequence. Exercise 7.7 What can you say theoretically about the period of the BBS generator? Exercise 7.8 Show that the length of a nite bit sequence (an )n{0,1} with independent bits gets shortened by applying the Neumann-lter by approximately the factor p(1p), if the relative proportion of ones is equal to p. (Theorem 7.4)

7.4 Principal Component Analysis (PCA)

155

7.4

Principal Component Analysis (PCA)


1

0.8

In multideminsional data sets quite often some variables are correlated or even redundant, as shown in the 2-dim. scatterplot beside. We may then for example reduce the dimensionality of the data. We follow chapter 12 in [7].

0.6

0.4

0.2

0 0 0.2 0.4 0.6 0.8 1

Given is a set of data points (x 1 , . . . , x N ), each x n being a vector of D dimensions. We want to project the points into a lower dimensional space with M < D dimensions. We start with looking for the direction in D-dim. space with highest variance of the data. Let u 1 a unit vector in this direction, i.e. u T 1 u 1 = 1. We project the data points x n onto this direction T yielding the scalar value u 1 x n . The mean of the projected data is 1 N and their variance 1 N 1
N

uT 1 xn
n=1

uT 1

1 N

x n = uT 1x
n=1

N T )2 = u T (u T 1 x n u1 x 1 Su1 n=1

To see this, the denition of the covariance of two scalar variables xi and xj is 1 S ij = N 1
N

(xni x i )(xnj x j )
n=1

where xni is the i-th component of the n-th data sample. The covariance matrix is 1 S= N 1 Thus uT 1 Su1 = 1 N 1
N N

)(x n x )T (x n x
n=1

)(x n x )T u 1 = uT 1 (x n x
n=1 N

1 N 1

)u T ) uT 1 (x n x 1 (x n x
n=1 N T )2 (u T 1 x n u1 x n=1

1 = N 1

(u T 1 xn
n=1

)(u T uT 1x 1 xn

) uT 1x

1 = N 1

In order to nd the vector u 1 which produces maximum variance u T 1 S u 1 , we will maximize this quantity by deriving it w.r.t. u 1 . To prevent u 1 we have to use the normalization condition u T 1 u 1 = 1 as a constraint, which yields the Lagrangian
T L = uT 1 S u 1 + 1 (1 u 1 u 1 ).

156 and the necessary condition for a maximum is L = 2S u 1 21 u 1 = 0, u 1 yielding S u 1 = 1 u 1 ,

7 Statistics and Probability

which is the eigenvalue equation for the covariance matrix S . Obviously, if we choose 1 as the largest eigenvalue, we will obtain highest variance, i.e.
T uT 1 S u 1 = u 1 1 u 1 = 1 .

From this we now can conclude Theorem 7.6 The variance of the data points is maximal in the direction of the eigenvector u 1 to the largest eigenvalue of the covariance matrix S . This maximal eigenvector is called the principal component.

Application to the above data points yields the two eigenvectors u1 = 0.788 0.615 u2 = 0.615 0.788

with the corresponding eigenvalues 1 = 0.128 and 2 = 0.011. The graph shows that the principal component u 1 points in the direction of highest variance. After nding the direction with highest variance, we partition the D-dimensional space into u 1 and its orthogonal complement. In the resulting (D 1)-dimensional space we again determine the principal component. This procedure will be repeated until we have M principal components. The simple result is Theorem 7.7 The eigenvectors u 1 . . . u M to the M largest eigenvalues of the S determine the M orthogonal directions of highest variance of the data set (x 1 , . . . , x N ). Proof: by induction: For M = 1 we refer to theorem 7.6. Now assume, the M directions with highest variance are already determined. Since u M +1 has to be orthogonal to u 1 . . . u M , we will require the constraints T T uT M +1 u 1 = u M +1 u 2 = . . . = u M +1 u M = 0. Similarly to the above procedure we will determine u M +1 by maximizing the variance of the data in the remaining space. As above, the variance of the data in the direction u M +1 is u T M +1 S u M +1 . Together with the above M orthogonality constraints and the normality constraint u T M +1 u M +1 = 1 we have to nd a maximum of the new Lagrangian
M

L=

uT M +1 S u M +1

+ M +1 (1

uT M +1 u M +1 )

+
i=1

i u T M +1 u i

7.4 Principal Component Analysis (PCA) with respect to u M +1 . It turns out (exercise 7.11) that the solution u M +1 has to fulll S u M +1 = M +1 u M +1

157

i.e. it is again an eigenvector of S . Obviously we have to select among the D M not yet selected eigenvectors the one with the largest eigenvalue. We now apply PCA to the Lexmed data from example 4.4 in section 4.3. Some raw data samples are: 19 13 18 73 36 18 19 62 1 1 2 2 1 2 2 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 1 0 362 383 362 376 372 366 372 376 378 385 370 380 382 378 378 390 13400 18100 9300 13600 11300 13000 6400 22000 0 0 0 1 0 0 0 0

After normalization of the data to the interval [0, 1] we obtain the eigenvalues:

0.47 0.24 0.19 0.16 0.16 0.11 0.10 0.039 0.036 0.023 0.023 0.016 0.016 0.01 0.004 Due to the step after the 7-th largest eigenvalue, a transformation of the data to the 7dimensional space spanned by the eigenvectors of the 7 largest eigenvalues may be considered. If for visualization we plot the data to the two principal components (eigenvectors to the two largest eigenvalues), we get for the raw data the left and for the normalized data the right diagram:

158 The corresponding two eigenvectors for the raw data are:

7 Statistics and Probability

(1, 0.2, 0.03, 0.02, 0.003, 0.06, 0.3, 0.04, 0.2, 0.2, 0.1, 3, 4, 10000, 0.004) 104 (100, 0.10, 0.16, 0.05, 0.04, 0.17, 0.27, 0.06, 0.09, 0.08, 0.03, 3.34, 5.66, 0.02, 0.17) 102

The rst vector projects on the leukocyte value and the second on a combination of the age and the fever values. Why?

7.4.1

Applications of PCA

Dimensionality reduction Data compression Extraction of features from pixel images Data visualization

An Image compression example1 5000 gray-scale images with 32 32 = 1024 pixels each. Application of PCA with 100 principal components. I.e. projection on 100-dimensional subspace. Transformation of compressed images back into original space.

100 Images2

1 2

From Andrew Ngs excellent lecture Machine Learning: ml-class.org. From ml-class.org.

7.4 Principal Component Analysis (PCA) Original and Recovered Images3

159

Bill Clinton4

36 Principal components5

Scalability
3 4

Would this work with 1 Megapixel images also? No! Why?

From ml-class.org. From ml-class.org. 5 From ml-class.org.

160 D = 106 dimensional space!

7 Statistics and Probability

5000 images = 5000 data points in 106 -dimensional space. N = 5000 data points dene a 4999-dimensional hyperplane. Thus we need: M N 1 = 4999.

Otherwise: Underdetermined problem! Compression by a factor of 106 /5000 = 200.

Back to Andrew Ngs Example D = 1024. 5000 images = 5000 data points in 1024-dimensional space. 5000 points in M = 100 dim. space. M = 100 4999 = N 1.

Structure of data can be conserved.

7.5

Estimators

Estimators & Properties This chapter covers the estimation of unknown parameters. Most often a parameterized distribution given, but with unknown true parameters. The goal is to estimate these parameters with the help of samples x (from the true distribution). We collect all parameters of interest in the variable First we start with the denition of an estimator followed by some easy examples and come back to this later when we talk about maximum likelihood estimators. An estimator T is used to infer the value of an unknown parameter in a statistical model. It is a function dened as: T : X where X is a sample space with elements x := {x1 , , xn } X Normally we will not be able to estimate the true parameter exactly and so we have to dene some properties that assures a certain quality of the estimations found with the help of T . The true parameter is unknown and so we have to look for other reasonable criteria. For example the expected value of the estimator should be the parameter to estimate. Desireable properies are:
unbiasedness: E[T ] =
minimum variance: An unbiased estimator T has minimum variance if var[T ] var[T ]

for all unbiased estimators T .

7.5 Estimators

161

Sample Mean & Sample Variance We can formulate the calculation of the sample mean and variance in terms of estimators: Let the xj be samples from a distribution with mean and variance 2
The function x : Rn R

1 x = n

xj
j =1

is called the sample mean


The function s2 : Rn R

1 s = n1
2

(xj x )2
j =1

is called the sample variance Example: Sample Mean & Sample Variance Sampling from a Gaussian distribution with mean = 5 and variance 2 = 2. The black line is a plot of the true Gaussian and the green line is a Gaussian were the mean and the variance is calculated with x and s2 respectively.
0.35 normpdf: =5, 2=2 0.3 estimation: =3.29, =3.92 # samples: 2
2

0.35 normpdf: =5, 2=2 0.3 estimation: =4.92, =2.28 # samples: 20


2

0.35 normpdf: =5, 2=2 0.3 estimation: =5.05, =2.02 # samples: 1000
2

0.25

0.25

0.25

0.2

0.2

0.2

0.15

0.15

0.15

0.1

0.1

0.1

0.05

0.05

0.05

0 10

10

15

20

0 10

10

15

20

0 10

10

15

20

As expected the estimation becomes better the more samples are used. Unbiasedness of Sample Mean As mentioned before there are some properties we want for an estimator to hold. We are going to proof the unbiasedness and leave the proof for the minimum variance criterion as an exercise to the reader. Proof: E[ x] = 1 n
n

E[xj ] =
j =1

Unbiasedness of Sample Variance

162 Proof: We can rewrite s2 as: s2 = E[s2 ] = = = 1 n1 1 n1 1 n1


n

7 Statistics and Probability

(xj )2
j =1 n

n ( x ) 2 n1

then

E[(xj )2 ]
j =1 n

n E[( x )2 ] n1

var[xj ]
j =1

n var[ x] n1

n n 2 n1 2 2 = = 2 n1 n1 n n1

Sample Mean & Sample Variance (variances) We can not only calculate the expected value of estimators, but also their variance. It is an exercise to proof the following: The variance varx of the estimator x is given by var[ x] = var[s2 ] = 1 2 n

2 4 n1

Expectations and Covariances The expectation of some function f (x) under a probability distribution p(x) is given by E[f ] = and the variance of f (x) is dened by var[f ] = E[(f (x) E[f (x)])2 ] = E[f (x)2 ] E[f (x)]2 For two random variables x and y , the covariance is dened by cov[x, y ] = E[(x E[x])(y E[y ])] = E[xy ] E[x]E[y ] p(x)f (x) dx

Covariance and Independence Remember, that for two independent variables x and y we have p(x, y ) = p(x) p(y ). Thus E[xy ] = = p(x, y )xy dx dy = p(y )y dy p(x)p(y )xy dx dy

p(x)x dx = E[x]E[y ]

and we get for independent variables cov[x, y ] = 0

7.6 Gaussian Distributions

163

7.6

Gaussian Distributions

Denition 7.14 A Gaussian distribution is fully specied by a D-dimensional mean vector and D D covariance matrix with the density function p(x ; , ) = N (x |, ) = 1 (2 )
D 2

1 exp (x )T 1 (x ) 2 | |
1 2

That is the mean and the covariance matrix of the normal distribution, has to be proven! If the variables x1, . . . , xD are all independent, then is diagonal! Why?

Examples: Mean Vector and Covariance Matrix


Example: Mean Vector
0.2 0.15 0.1 0.05 0 5 0.2 0.15 0.1 0.05 0 5 0.2 0.15 0.1 0.05 0 5

5 0 5 5 5 5 0

5 0 5 5

0 = 0
0.2 0.15 0.2 0.15

3 = 0
0.2 0.15 0.1 0.05 0 5

2 2

Example: Covariance Matrix


0.1 0.1 0.05 0 5 0.05 0 5

5 0 5 5 5 5 0

5 0 5 5

1 0 = 0 1

0.5 0 = 0 0 .5

2 0 0 2

164
0.2 0.15 0.2 0.15 0.2 0.15 0.1 0.05 0 5

7 Statistics and Probability


0.1

Example: Covariance Matrix II


0.1 0.05 0 5 0.05 0 5

5 0 5 5 5 5 0

5 0 5 5

1 0 = 0 1
0.2 0.15 0.2 0.15

1 0 .5 = 0.5 1
0.2 0.15 0.1 0.05 0 5

1 0.8 0.8 1

Example: Covariance Matrix III


0.1 0.1 0.05 0 5 0.05 0 5

5 0 5 5 5 5 0

5 0 5 5

1 0 = 0 1

1 0.8 = 0.8 1

3 0.5 0.5 1

Covariance Matrix Properties The covariance matrix is symmetric


is invertible and 1 is symmetric All eigenvalues are real All eigenvectors are orthogonal Eigenvectors point in the direction of principal axes of the ellipsoid.

The covariance matrix is positive denite


= xT x > 0

x Rn \ {0}

All eigenvalues are positive is invertible and 1 is positive denite

Diagonalization of the Covariance Matrix Let u 1 . . . u D the eigenvectors of . Then the transformation x y with yi = u T i (x ) makes all variables yi pairwise independent with diagonal covariance matrix and zero mean.

7.6 Gaussian Distributions Product of Gaussian Distributions The product of two Gaussian distributions is given by N (a , a ) N (b , b ) = zc N (c , c ) where
1 1 c = c a a + b b 1 1 and c = a + b 1

165

Marginal Gaussian Distribution Recall, in general, the marginal distribution for a joint random variable p(x , y ) is given by p(x ) = Given a joint distribution a A C , T b C B the marginal Gaussian distribution is simply given by p(x , y ) = N p(x ) = N (a, A) p(x , y ) dy

Conditional Gaussian Distribution The conditional distribution, in general, is given by p(x , y ) p(x |y ) = p(y ) Given a joint distribution a A C p(x , y ) = N , T b C B the conditional Gaussian distribution is given by p(y |x ) = N b + CA1 (x a), B CA1 C T

Marginal & Conditional Gaussian Distribution


1 xb xb = 0.7 10

p( xa |xb = 0.7)

0.5

p(xa , xb ) p( xa )

0.5

xa

0.5

xa

166

7 Statistics and Probability

7.7

Maximum Likelihood
3.5 3 2.5 p(x) 2 1.5 1 0.5 0 0 0.2 0.4 0.6 x 0.8 1 1.2 1.4

Which one of the following normal distributions maximizes the probability for independently

observing the given data points?

Maximum Likelihood for Gaussian distributions Let x1 , . . . , xn , be i.i.d (independently and identically distributed) according to N (, 2 ) and x := {x1 , , xn }, then the joint density is:
n

Lx (, ) = p(x|, ) =
2 2 j =1 n

p(xj |, 2 ) 1 (xj )2 2 2

=
j =1

1 2 2

exp

The log likelihood function is given by 1 ln Lx (, ) = 2 2


2 n

(xj )2
j =1

n n ln 2 ln(2 ) 2 2

Maximizing ln Lx (, 2 ) with respect to , we obtain the maximum likelihood solution given by n 1 M L = xj n j =1 what we recognize as the sample mean. Maximizing ln Lx (, 2 ) with respect to 2 leads to
2 M L

1 = n

(xj M L )2
j =1

which is dierent from the sample variance and therefore biased. The Likelihood Function The Maximum likelihood estimator is a mapping from samples to parameter values for which the likelihood function becomes a maximum. The formal denition of a likelihood function is: Let be the parameter space and p the joint density w.r.t. , then the likelihood function Lx is dened as:

7.7 Maximum Likelihood Lx : R+ Lx ( ) := p(x| ) x := {x1 , , xn } X

167

The likelihood function is a function of the parameters , where as the joint density is a function of x ! The dierence is, that we normally have a probability distribution p (x) with parameters given and we evaluate this function at various inputs x. We now assume, that we do know the parameters, but that have given some samples x from the true underlying distribution and our goal is to estimate these parameters. We do this by searching for some parameter values that maximize the likelihood function (and so maximize also the probability density). Maximum Likelihood We call the estimator T maximum likelihood estimator (ML estimator) if T : X with Lx (T (x)) = sup Lx ( ),

x := {x1 , , xn } X

In many cases it is possible to derive the likelihood function and set its derivative with respect to the parameters to zero. Sometimes it is also easier to maximize the so called log likelihood lx ( ) := ln Lx ( ) Bernoulli Distribution Outcome is either a success or failure (e.g. coin ipping with heads = 1 and tails = 0) p (x = 1) = p (x = 0) = 1 Bernoulli Distribution: Bern (x) = x (1 )1x E[x] = var[x] = (1 ) Example: ML for Bernoulli distributions Let xj , j Nn , be i.i.d according to Bern (xj ) with p(xj = 1) = and p(xj = 0) = 1 . Then the joint probability is given by p(x|) = Solving the equation: 1 ln Lx () = leads to M L 1 = n
n P xj

(1 )n
n

xj

x = (x1 , , xn ) {0, 1}n


n

1 xj (n 1 j =1

xj ) = 0
j =1

xj
j =1

168

7 Statistics and Probability

7.8

Linear Regression

Maximum Likelihood Linear Regression Assumption: y i = a T f ( xi ) + i where i are i.i.d according to i N (0, 2 )
5

yi
2
y

-1 0 0.5 1 1.5 x

xi

2.5

Figure 7.3: Example of sample points drawn from a function a1 sin x + a2 cos x with added gaussian noise i .

(i )2 1 exp 2 p(i ; 2 ) = 2 2 2 This implies that 1 2 2 (yi a T f (xi ))2 2 2

p(yi |xi ; a , 2 ) =

exp

ML Linear Regression | Likelihood Function Given X (the design matrix, which contains all the xi s) and y (containing all the yi s)
n

p(y |X ; a , ) =
2 j =1 2

(yj a T f (xj ))2 exp 2 2 2 2 1


n

1 1 ln p(y |X ; a , ) = 2 2

(yj a T f (xj ))2


j =1

n n ln 2 ln(2 ) 2 2

Note: maximizing ln p(y |X ; a, 2 ) w.r.t. a is the same as minimizing 1 2


n

(yi a T f (xi ))2


i=1

ML linear regression = Least squares solution!

7.8 Linear Regression ML Linear Regression | Determining a M L 1 2


n

169

a ln p(y |X ; a , 2 ) = setting this to zero


n

(yj a T f (xj ))f (xj )


j =1

0=
j =1 n

yj f (xj )
j =1 n

(a T f (xj ))f (xj ) (f (xj )f (xj )T )a


j =1 n

=
j =1 n

yj f (xj ) yj f (xj )
j =1 j =1

(f (xj )f (xj )T )a FTFa

FTy

ML Linear Regression | Determining a M L


n n

0=
j =1

yj f (xj )
j =1

(f (xj )f (xj )T )a FTFa

FTy with

f (x1 )T f (x2 )T F = . . . f (xn )T Remember: The matrix F is equal to the matrix M we know from section 6.3 on least squares. ML Linear Regression | Determining a M L We see, that a M L is given by a M L = (F T F )1 F T y Furthermore we notice, that maximizing the likelihood (under Gaussian noise assumption) is equivalent to solving least squares!

170 ML Linear Regression | Determining 2 Approach:

7 Statistics and Probability

determined the ML solution of the weights denoted by a M L


2 subsequently use a M L to nd M L by 2 M L =

1 n

n 2 (yi a T M L f (xi )) i=1

ML Linear Regression | Predictive Distribution The probabilistic model we have now, leads us to the predictive distribution. For some new prediction input values x , the prediction output y is distributed according to
2 T 2 p(y |x ; a M L , M L ) = N a M L f (x ), M L

Fitting a 9th order polynomial to samples of the function sin(2x) with Gaussian noise.
1.5 1.5

0.5

0.5

0.5

t 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

0.5

1.5

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

1.5

1.5

0.5

0.5

0.5

0.5

1.5 0 1.5 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

1.5

0.5

0.5

0.5

0.5

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

7.8 Linear Regression Bayesian Inference Towards a more Bayesian treatment: posterior likelihood prior we have to do the following steps:
dene prior distribution over the parameters a as p(a ) obtain the likelihood p(y |X, a ) calculate the posterior p(a |X, y ) p(y |X, a )p(a )

171

Example: Bayesian Inference


The likelihood function was given by
n

p(y |X, a ; 2 ) =
i=1

N yi |a T f (xi ), 2

For simplicity we assume a zero-mean isotropic Gaussian prior over a with parameter p(a ; ) = N (a |0, I ) The corresponding posterior distribution over a is then given by

p(a |y, X, ; 2 ) = N (a |mn , Sn ) where mn = 1 Sn F T y 2


1 and Sn =

1 1 I + 2FTF

Example: Bayesian Inference The log of the posterior distribution is given by the sum of the log likelihood and the log of the prior as: ln p(a |X, y ) = 1 2 2
n

yi a T f (xi )
i=1

1 T a a + const 2

Maximizing the posterior (MAP) distribution w.r.t. a leads to a M AP = (I + F T F )1 F T y with = 2

172

7 Statistics and Probability

Example: MAP In this example we t a straight line to data coming from y = 0.5x 0.3 with N (0, 0.04) noise. We can directly plot the parameter space: With = 0.5, the parameter prior is

1 0.5

w2

0.5

0 0.5

0.5

0.5 w1

1 1

y 0 w1 1

0.5 x

Example: MAP Now we sequentially receive some data likelihood prior/posterior data space

w2

w2

1 1 1

0 w1

1 1 1

y 0 w1 1

1 1 1

0 x

w2

w2

1 1

0 w1

1 1

y 0 w1 1

1 1

0 x

7.8 Linear Regression

173

w2

w2

1 1 1

0 w1

1 1 1

y 0 w1 1

1 1 1

0 x

w2

w2

1 1
1

0 w1

1 1
1

y 0 w1 1

1 1
1

0 x

w2

w2

1 1

0 w1

1 1

y 0 w1 1

1 1

0 x

Reminder: Conditional Probabilities discrete Variables: p(A) =


B

p(A, B )

continuous variables: p(x, y ) = conditioning: p(x|y ) = p(x, a, y ) p(x, y ) = da p( y ) p(y ) p(x, a, y ) p(a, y ) = da = p(a, y ) p(y ) p(x, a, y )da

p(x|a, y )p(a|y )da

Bayesian Linear Regression

174

7 Statistics and Probability

In practice, we want to make predictions of t for new values of x . This requires to evaluate the predictive distribution dened by

p(y |x , y, X, ; 2 ) = The convolution is a Gaussian with

p(y |x , y, X, a ; 2 )p(a |y, X, ; 2 ) da

2 p(y |x , y, X, ; 2 ) = N m T n f (x ), n (x )

where
2 n (x ) = 2 + f (x )T Sn f (x )

Example: Comparison between ML and Bayesian approach Fitting a 9th order polynomial to samples of the function sin(2x) with Gaussian noise. Maximum Likelihood
1.5 1.5 1 1

Bayes approach

0.5

0.5

0.5

t 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

0.5

1.5

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

1.5

1.5

0.5

0.5

0.5

t 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

0.5

1.5

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

7.8 Linear Regression


1.5 1.5

175

0.5

0.5

0.5

t 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

0.5

1.5

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

1.5

1.5

0.5

0.5

0.5

t 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

0.5

1.5

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

1.5

1.5

0.5

0.5

0.5

t 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

0.5

1.5

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

1.5

1.5

0.5

0.5

0.5

t 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

0.5

1.5

1.5 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1

Final Comments In a fully Bayesian setting we should introduce priors over both, and 2 , but this is analytically intractable:

176

7 Statistics and Probability

p(y |y ) = Have a look at

p(y |a , 2 )p(a |y, , 2 )p(, 2 ) da d d 2

Empirical Bayes: maximizing the marginal likelihood Laplace approximation: local Gaussian approximation of the posterior Expectation maximization (EM)

7.9 Exercises

177

7.9

Exercises

Exercise 7.9 Calculate the probability distribution of the mean of n independant identically distributed discrete random variables X1 , . . . , Xn , with p(Xi = 0) = p(Xi = 1) = p(Xi = 2) = p(Xi = 3) = p(Xi = 4) = 1/5 for n = 1, 2, 3, 4. Exercise 7.10 Prove the following identities for derivatives w.r.t. vectors: (a T x ) a) = a. x (x T Ax ) = (A + AT )x . b) x Exercise 7.11 To complete the proof of theorem Theorem 7.7, nd a maximum of the T variance u T M +1 S u M +1 with respect to u M +1 under the constraints u M +1 u M +1 = 1 and T T uT M +1 u 1 = u M +1 u 2 = . . . = u M +1 u M = 0. Exercise 7.12 Apply PCA to the Lexmed data. The data le appraw1-15.m with the variables number 1 to 15 (variable number 16 removed) can be downloaded from the course website. a) Determine the the eigenvalues and eigenvectors for the raw data. b) Normalize the data to the interval [0, 1] and repeat PCA. c) Explain the dierences. d) Select the largest eigenvalues and give the transformation matrix for transforming the data into a lower dimensional space. Exercise 7.13 Plot various two-dimensional normal distributions N (, ) and validate empirically the following propositions. You may use for example N 1 10 0 , 1 0 1 and N 1 1 0 , 1 0 10 .

a) The sum of two normal distributions is not a normal distribution. b) The maximum of two normal distributions is not a normal distribution. c) The product of two normal distributions is a normal distribution. Exercise 7.14 Show that cov[x, y ] = E[xy ] E[x]E[y ]. Exercise 7.15 Give an example for an estimator with 0 variance. Exercise 7.16 Show that a) E [ x] = .
1 2 . b) var[ x] = n

c) for the sample variance it holds: 1 s = n1


2 n

(xj )2
j =1

n ( x )2 . n1

178

7 Statistics and Probability

Exercise 7.17 Give an example for an unbiased estimator for the mean with higher variance than the sample mean. Exercise 7.18 Let U (a, b) the uniform distribution over the interval [a; b] R with a < b. Further x := (x1 , , xn ) Rn are ordered samples from an unknown U (a, b) s.t. x1 xn . The parameter space is denoted by = {(a, b) R2 |a < b}. Dene a density function ua,b of U (a, b) and the likelihood function. Determine a maximum likelihood estimator for (a, b). Exercise 7.19 Show that the expectation of a variable x that is Gaussian distributed with N (, 2 ) is or in other words: E[x] = N (, 2 )xdx = . You can use the fact, that a Gaussian is a probability distribution and therefore integrates to 1 and that for an odd a function f the following holds true: a f (x)dx = 0. Exercise 7.20 Show that estimating the maximum posterior (MAP) with Gaussian likelihood and Gaussian prior (as in the lecture) with aM AP = (I + F T F )1 F T y is equal to regularized least squares which is the original least squares formulation plus some penalty term for high parameter values: 1 E (a) = 2 (aT f (xi ) yi )2 + ||a||2 2 i=1
n

Hint: Calculate the derivative of E with respect to a and set it to zero. Exercise 7.21 Prove that the expected value is linear, i.e. that E [ax + b] = aE [x] + b for a) discrete variables. b) continuous variables.

Chapter 8 Function Approximation


8.1 Linear Regression Summary
f (x ) = a1 f1 (x ) + + ak fk (x ) = a T f (x ) with k unknown parameters a1 , . . . , ak through the n data points (x 1 , y1 ), . . . , (x n , yn ). If we substitute all the points into the ansatz, requiring our function to hit all n points, i.e. f (x i ) = yi , we get the linear system a1 f1 (x 1 ) + . . . + ak fk (x 1 ) = . . . . . . In matrix notation we get M a =y with M ij = fj (x i ), y1 . . .

We want to t a function

a1 f1 (x n ) + . . . + ak fk (x n ) = yn .

n > k the system is overdetermined and normally has no solution. n < k the system is underdetermined and normally has innitely many solutions. We examined dierent solutions for the linear regression problem: Overdetermined case: Least Squares / Pseudoinverse Maximum Likelihood Bayesian Linear Regression

Underdetermined case: Pseudoinverse

180 Methods for solving M a = y Overdetermined case: Least Squares / Pseudoinverse: minimize ||M a y ||2 Maximum Likelihood: maximize p(X |a )

8 Function Approximation

a = (M T M )1 M T y

a = (M T M )1 M T y

Bayesian lin. Regression (MAP = maximum posterior probab.): maximize p(a |X ) Regularized Least Squares: minimize ||M a y ||2 + ||a ||2 design matrix X = (x 1 , . . . , x n ) Methods for solving M a = y Underdetermined case:
minimize ||a ||2 under the constraint M a y = 0 .

a = (I + M T M )1 M T y

a = (I + M T M )1 M T y

Solution: a = (M M T )1 M y

compare (AI lecture)[?]: maximize Entropy of probability distribution under given contraints

8.2
8.2.1

Radial Basis Function Networks


Introduction

Radial basis function networks (RBFs) are a form of supervised learning techniques that are used to model or estimate an unknown function between a set of input-output pairs. The idea of RBFs has been presented as a solution for non-linear classication problems. The theory of RBFs had initiated from Cover whereby his theorem proved that a classication problem is more likely to be linearly separated in a high dimensional space rather than in a low dimensional space. Further discussion about the Covers theorem accompanied with a detailing example will be presented in the next section. Radial basis function networks are considered to be linear models with non-linear activation functions. Linear approximation models had been studied in statistics for about 200 years, and the theory is applicable to radial basis function networks (RBF) which are just one particular type of linear models. The idea of radial basis function networks is similar to that of multi layer perceptron neural networks with dierences such as:
Radial basis function as an activation function, rather than a sigmoid function. Three layer network with an input, one hidden and an output layer. No back propagation is included in solving for the output weights.

8.2 Radial Basis Function Networks

181

There are two main applications for radial basis functions. The rst is a solution of classication problems which will be briey mentioned in the next section so to explain the Covers theorem. The other idea of interest is utilizing RBFs as a solution for an approximation problem (i.e estimating a function that maps sets of input-output pairs) will be further discussed and detailed.

8.2.2

RBFs for function approximation

Now the focus will be shifted to the form of RBFs used for function approximation. In other ways answering the question of supervised learning problem, which could be stated as: Given a set of input output pairs, attain the unknown function mapping the latter set. To have a detailed idea of the subject a brief introduction to supervised learning will be mentioned. 8.2.2.1 Supervised Learning

A problem in statistics with applications in many areas is to guess or to estimate a function from a sample of input-output pairs with a little or no knowledge of the form of the function. So common is the problem that it has dierent names in dierent disciplines (e.g. nonparametric regression, function approximation, system identication, inductive learning). In machine learning terminology, the problem is called supervised learning. The function is learned from examples, which a teacher supplies. The set of examples, or training set, contains elements which consist of paired values of the independent (input) variable x the dependent (output) variable y . Mathematically given a vector n-patterns of a p-dimensional vector x The training set (pairs of input and outputs) is given as: T = {(x i , y i )}n i=1 (8.1)

This training set reects that the outputs y are corrupted by noise. In other words the correct value to the input x i , namely yi , is unknown. The training set only species y i which is yi , plus a small amount of noise. y i = f (x i ) + Where is some form of Gaussian noise with zero mean and some covariance. (8.2)

In real applications the independent variable values in the training set are often also aected by noise. This type of noise is more dicult to model and we shall not attempt it. In any case, taking account of noise in the inputs is approximately equivalent to assuming noiseless inputs but an increased amount of noise in the outputs. 8.2.2.2 Nonparametric Regression

In regression problems there are two deviations the parametric and nonparametric approach. Parametric regression is a form of regression whereby the functional relation of the inputoutput pairs is assumed to be known, but may contain unknown parameters. This case is not of interest, because it has a main disadvantage that the functional topology should be known in advance to solve such a problem. This prior knowledge, is dicult to be found especially in the case of complicated and highly nonlinear systems. Therefore the focus will be shifted to the nonparametric approach, where no prior knowledge of the functional mapping is required. Radial basis function networks are a form of nonparametric regression,

182

8 Function Approximation

that aim to nd an underlying relation between inputs and outputs[24]. In other words the goal of the radial basis function network is to t the best values of some weights in order to minimize a certain error dened by an error function. 8.2.2.3 Linear Models

A linear model of a function f (x ) takes the form:


m

f (x ) =
j =1

aj fj (x )

(8.3)

The model f is expressed as a linear combination of a set of m xed functions (often called basis functions by analogy to the concept of a vector being composed of a linear combination of basis vectors). The aim of any network is to nd the best possible weights aj so to minimize the sum of the squared error that is often dened by the error function. Activation Functions Before going into the details on how to solve for the weights we discuss the activation functions (fj ).There are several types of activation functions that are used in neural networks, but the functions of interest are the radial functions. Radial functions are a special class of functions. There characteristic features is that there response decreases (or increases) monotonically with the distance from the central point.
Gaussian which is the most commonly used:

(||x i c j ||2 ) fj (x ) = exp 2 2

(8.4)

The Gaussian function decreases monotonically with the distance from the center as shown in gure 8.1.
Multiquadric:

fj (||x c ||) =
Inverse Multiquadrics:

(||xi cj ||)2 + b2

(8.5)

fj (||x c ||) =

1 (||xi cj ||) +
2

(8.6) b2

8.2.2.4

Radial Basis Function Networks

Radial functions are simply a class of functions. In principle, they could be employed in any sort of model (linear or nonlinear) and any sort of network (single-layer or multilayer). However, since Broomhead and Lowes[25], radial basis function networks (RBF networks) have traditionally been associated with three layers as follows, see gure 8.2 :
Input layer of dimensions representing the n patterns of the p-dimensional input vector x Hidden layer containing the activation radial functions (such as Gaussian) with number m

8.2 Radial Basis Function Networks


Three gaussian functions centered at different inputs & the net output 1.4 =0.3 =0.5 =0.8 NetOuptut

183

1.2

Output[y]

0.8

0.6

0.4

0.2

0 1

0.5

0.5

1.5 Input[x]

2.5

3.5

Figure 8.1: Weights were set to a1 = 0.4,a2 =0.7 and a3 = 0.9

Figure 8.2: Structure of a RBF network


Linear output unit

The unknowns in the case of a linear RBF model are the weights aj s that need to be found and solved for. In order to solve for the weights the problem should be reformulated in a sums of squared errors form. Least Square Problem An error function of the weights should be dened and then an optimization procedure will be used to attain them. Lets consider the overall picture again, given a data set T = {(x i , y i )}n i=1 we have to estimate a function between these input and output pairs. From gure 8.2 it can be seen that the output function is:
m

f (x ) =
j =1

aj fj (x )

(8.7)

184

8 Function Approximation

Then we dene the error function as the sum of the squared errors, between the real valued yi s and the predicted ones from the RBF network as follows: E (y, f (x )) = 1 2
n

(yi f (xi ))2


i=1 n m 2

1 E (y, f (x )) = 2

(yi
i=1 j =1

aj fj (xi ))

(8.8)

The objective now is to nd the best set of the aj that minimizes the error function E of equation (8.8). Mathematically formulated the above idea could be described as follows: aj = arg min
(a1 ,...,aj ,...,am )

E (y, f (x ))

(8.9)

Several algorithms had been suggested for such an evaluation[6], and maybe the most common is the gradient descent algorithm. This algorithm might have some problems like convergence, getting stuck in a local minimum and so on. Therefore, it would be better if there was a way to represent the above equation in a matrix form, and then a single step to solve for the weights would be utilized[2]. For this formulation consider the following:
Let y = Let a =

y1 y2 . . . a1 a2 . . .

yn

T T

represent the desired outputs.

am represent the weights that have to be determined. f1 (x1 ) f2 (x1 ) . . . fm (x1 ) f1 (x2 ) f2 (x2 ) . . . fm (x2 ) Let the matrix M = be the matrix of the RBFs . . . . . . . . . . . . f1 (xn ) f2 (xn ) . . . operating at the input points. fm (xn )

There for the above system could be transformed into the following form: Ma = y (8.10)

Therefore solving for the weights after this formulation is straight forward and requires only the inversion of the M matrix. Assuming the M is nonsingular and M 1 exists then the weights could be calculated using the following equation as: a = M 1 y (8.11) A special case of this solution is when the number of the hidden layer units (i.e Gaussian Functions) is equal to that of the number of samples present by the training set T . In other words the M -matrix is an n by n matrix, and there normal inversion exists in the case the latter matrix was non-singular. On the other hand, if this matrix was not a square one which is the most general case, whereby the number of hidden units m is less than that of the training sample n, then the M 1 -matrix could not be attained normally. Rather the pseudo-inverse has to be calculated. To do this there are dierent methods some of which are:
QR-decomposition Single Value Decomposition (SVD)

8.2 Radial Basis Function Networks

185

Concrete Example Consider the following three points (1, 3),(2, 2.1),(3, 2.5) to be approximated by a function. The RBFs used are Gaussian centered at each input point. The objective of this example is to illustrate the eect of the choice of on the underlying function being approximated. It is clear from gure 8.3, that having a small causes overtting, and the choice of a big
Effect of choice 4.5 4 3.5 3 Outputs[y] 2.5 2 1.5 1 0.5 0 5 =0.1 Labels =0.5 =8

0 Inputs[x]

10

Figure 8.3: The eect of sigma caused very high and low overshoots. The latter case could be explained by the fact that choosing a high value for the consequently leads to attaining very high positive and negative values of the weights tting the required points, so that the function thus approximated could pass through all the points presented.

8.2.3

Over-tting Problem

Consider that we have chosen the number of the basis functions to be the same number as the training examples T , moreover we have chosen the centers of the radial basis function networks to be the input points. This leads to the so-called problem of overtting. As clear

Figure 8.4: Overtting eect from gure 8.4, the function which was supposed to be approximated is the one represented

186

8 Function Approximation

by the dashed line, but due to the latter conguration of the RBF it rather tended to approximate the bold line, which is not the intended mapping. The network described in this example, is a specic type of RBFs used solely for interpolation. The problems of such a scheme are: 1. Poor performance on noisy data:
As already known, we do not usually want the networks outputs to pass through all the data points when the data is noisy, because that will be a highly oscillatory function that will not provide good generalization.

2. Computationally inecient:
The network requires one hidden unit (i.e. one basis function) for each training data pattern, and so for large data sets the network will become very costly to evaluate. The matrix inversion cost is typically O(n3 ) for n data points.

8.2.3.1

Improving RBFs

In order to improve the RBF networks such that it doesnt conduct solely exact interpolation, the following points could be taken into account: 1. The number m of basis functions (hidden units) should be less than n. 2. The centers of the basis functions do not need to be dened as the training data input vectors. They can instead be determined by a training algorithm. 3. The basis functions need not all have the same width parameter . These can also be determined by a training algorithm. 4. We can introduce bias parameters into the linear sum of activations at the output layer. These will compensate for the dierence between the average value over the data set of the basis function activations and the corresponding average value of the targets. The most general approach to overcome overtting is to assume that the centers and the width of the Gaussian functions are unknown, and apply a supervised learning algorithm to solve for all the variables. This approach also includes a regularization term that thus form the so called regularization network. The latter idea lies behind the fact that if we add a regularization term for the network being the gradient of the function intended in approximation, will form a network that does not rely only on interpolation, rather on both interpolation and smoothing as follows: Enew = Enormal + Ereg Enew 1 = 2
n

(yi
i=1

1 aj fj (x )) + ||F ||2 2 j =1

(8.12)

This approach will not be discussed here, rather a clustering algorithm to choose the centers is represented. As mentioned above, that the correct choice of the centers aects critically the performance of the network and the function thus approximated. For that sake the correct choice of the centers for the radial basis functions being approximated is critical. The upcoming section will clarify a specic clustering algorithm for the choice of the centers and the widths.

8.2 Radial Basis Function Networks 8.2.3.2 Autonomous determination of center

187

The choice of the centers of the radial basis functions could be done using a K-means Clustering, and could be described as follows:
The algorithm partitions data points into K disjoint subsets (K is predened). The clustering criteria are:

the cluster centers are set in the high density regions of data a data point is assigned to the cluster with which it has the minimum distance to the center Mathematically this is equivalent to minimizing the sum of square clustering function dened as :
k

E=
j =1 nSj

||x n c j ||2 1 Nj xn
nSj

cj =

(8.13)

Where Sj is the j th cluster with Nj points. After achieving the centers, now the values of the could be set according to the diameters of the clusters previously attained. For further information about the K-means clustering please refer to [?].

188

8 Function Approximation

8.3

Clustering

If we search in a search engine for the term mars, we will get results like the planet mars and Chocolate, confectionery and beverage conglomerate which are semantically quite dierent. In the set of discovered documents there are two noticeably dierent clusters. Google, for example, still lists the results in an unstructured way. It would be better if the search engine separated the clusters and presented them to the user accordingly because the user is usually interested in only one of the clusters. The distinction of clustering in contrast to supervised learning is that the training data are unlabeled. Thus the pre-structuring of the data by the supervisor is missing. Rather, nding structures is the whole point of clustering. In the space of training data, accumulations of of data such as those in Figure 8.5 are to be found. In a cluster, the distance of neighboring points is typically smaller than the distance between points of dierent clusters. Therefore the choice of a suitable distance metric for points, that is, for objects to be grouped and for clusters, is of fundamental importance. As before, we assume in the following that every data object is described by a vector of numerical attributes.

Figure 8.5: Simple two-dimensional example with four clearly separated clusters.

8.3.1

Distance Metrics

Accordingly for each application, the various distance metrics are dened for the distance d between two vectors x and y in Rn The most common is the euclidean distance
n

de (x , y ) =
i=1

(xi yi )2 .

Somewhat simpler is the sum of squared distances


n

dq (x , y ) =
i=1

(xi yi )2 ,

which, for algorithms in which only distances are compared, is equaivalent to the euclidean distance. Also used are the aforementioned Manhattan distance
n

dm (x , y ) =
i=1

|xi yi |

8.3 Clustering as well as the distance of the maximum component d (x , y ) = max |xi yi |
i=1,...,n

189

which is based on the maximum norm. During text classication, the normalized projection of the two vectors on each other, that is, the normalized scalar product x y |x | |y | is frequently calculated, where |x | is the euclidian norm of x . Because this formula is a metric for the similarity of the two vectors, as a distance metric the inverse ds ( x , y ) = |x | |y | x y

can be used, or > and < can be swapped for all comparisons. In the search for a text, the attributes x1 , . . . , xn are calculated similarly to naive Bayes as components of the vector x as follows. For a dictionary with 50,000 words, the value xi equals the frequency of the i-th dictionary word in the text. Since normally almost all components are zero in such a vector, during the calculation of the scalar product, nearly all terms of the summation are zero. By exploiting this kind of information, the implementation can be sped up signicantly.

8.3.2

k-Means and the EM Algorithm

Whenever the number of clusters is already known in advance, the k-Means algorithm can be used. As its name suggests, k clusters are dened by their average value. First the k cluster midpoints 1 , . . . , k are randomly or manually initialized. Then the following two steps are repeatedly carried out: Classication of all data to its nearest cluster midpoint Recomputation of the cluster midpoint.

The following scheme results as an algorithm:

k-means(x 1 , . . . , x n , k ) initialize 1 , . . . , k (e.g. randomly) Repeat classify x 1 , . . . , x n to eachs nearest i recalculate 1 , . . . , k Until no change in 1 , . . . , k Return(1 , . . . , k )

The calculation of the cluster midpoint for points x 1 , . . . , x l is done by = 1 l


l

x i.
i=1

190

8 Function Approximation

The execution on an example is shown in Figure 8.6 for the case of two classes. We see how after three iterations, the class centers, which were rst randomly chosen, stabilize. While this algorithm does not guarantee convergence, it usually converges very quickly. This means that the number of iteration steps is typically much smaller than the number of data points. Its complexity is O(ndkt), where n is the total number of points, d the dimensionality of the feature space, and t the number of iteration steps.

random initialization

t=1

t=2

t=3

with the initial centers, and to the right is the cluster after each iteration. After three iterations convergence is reached.

Figure 8.6: k-means with two classes (k = 2) applied to 30 data points. Far left is the data set

In many cases, the necessity of giving the number of classes in advance poses an inconvenient limitation. Therefore we will next introduce an algorithm which is more exible. Before that, however, we will mention the EM algorithm, which is a continuous variant of k-means, for it does not make a rm assignment of the data to classes, rather, for each point it returns the probability of it belonging to the various classes. Here we must assume that the type of probability distribution is known. Often the normal distribution is used. The task of the EM algorithm is to determine the parameters (mean i and covariance matrices i of the k multidimentional normal distributions) for each cluster. Similarly to k-means, the two following steps are repeatedly executed: Expectation: For each data point the probability P (Cj |x i ) that it belongs to each cluster is calculated. Maximization: Using the newly calculated probabilities, the parameters of the distribution are recalculated. Thereby a softer clustering is achieved, which in many cases leads to better results. This alternation between expectation and maximization gives the algorithm its name. In addition to clustering, for example, the EM algorithm is used to learn Bayesian networks. [?].

8.3.3

Hierarchical Clustering

In hierarchical clustering we begin with n clusters consisting of one point each. Then the nearest neighbor clusters are combined until all points have been combined into a single cluster, or until a termination criterion has been reached. We obtain the scheme

8.3 Clustering

191

HierarchicalClustering(x 1 , . . . , x n ) initialize C1 = {x 1 }, . . . , Cn = {x n } Repeat Find two clusters Ci and Cj with the smallest distance Combine Ci and Cj Until Termination condition reached Return(tree with clusters) The termination condition could be chosen as, for example, a desired number of clusters or a maximum distance between clusters. In Figure 8.7 this algorithm is represented schematically as a binary tree, in which from bottom to top in each step, that is, at each level, two subtrees are connected. At the top level all points are unied into one large cluster.

11 10 9 8 7 6 5 4 3 2 1 0

Figure 8.7: In hierarchical clustering, the two clusters with the smallest distance are combined in
each step.

It is so far unclear how the distances between the clusters are calculated. Indeed, in the previous section we dened various distance metrics for points, but these cannot be used on clusters. A convenient and often used metric is the distance between the two closest points in the two clusters Ci und Cj : dmin (Ci , Cj ) = min d(x , y ). x Ci , y Cj

Thus we obtain the nearest neighbor algorithm, whose application is shown in Figure 8.8. We see that this algorithm generates a minimum spanning tree.1 The example furthermore shows that the two described algorithms generate quite dierent clusters. This tells us that for graphs with clusters which are not clearly separated, the result depends heavily on the algorithm or the chosen distance metric. For an ecient implementation of this algorithm, we rst create an adjacency matrix in which the distances between all points is saved, which requires O(n2 ) time and memory. If the number of clusters does not have an upper limit, the loop will iterate n 1 times and the asymptotic computation time becomes O(n3 ). To calculate the distance between two clusters, we can also use the distance between the two farthest points dmax (Ci , Cj ) = max d(x , y ). x Ci ,y Cj
1

A minimum spanning tree is an acyclic, undirected graph with the minimum sum of edge lengths.

cluster distance

level

192

8 Function Approximation

dmin 1.2

dmin 1.6

dmin 2

min. spanning tree

Figure 8.8: The nearest neighbor algorithm applied to the data from Figure 8.6 at dierent levels with 12, 6, 3, 1 clusters. and obtain the farthest neighbor algorithm. Alternatively, the distance of the clusters midpoint d (Ci , Cj ) = d(i , j ) is used. Besides the clustering algorithm presented here, there are many others, for which we direct the reader to [?] for further study.

8.4

Singular Value Decomposition and the Pseudo-Inverse

In Theorem 6.8 we have seen that for the computation of the pseudoinverse of an overdetermined matrix M the square matrix M TM must be invertible. Analogously, due to Equation 6.25, for an underdetermined matrix M the square matrix M M T has to be invertible. In both cases, the resulting square matrix is invertible if the matrix M has full rank. We will now present an even more general method for determining a pseudoinverse even if M has not full rank. Reminder: Linear Algebra Recommended preparation: Gilbert Strang Video Lectures Lecture 21: Eigenvalues and eigenvectors Lecture 25: Symmetric matrices and positive deniteness

on http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010 Denition 8.1 Two vectors x i , x j are called orthonormal if xT i x j = ij = 1 if i = j . 0 else

A matrix A is called orthogonal, if its columns are orthonormal.

Some basic facts: For any orthogonal matrix A we have AT A = I .

8.4 Singular Value Decomposition and the Pseudo-Inverse No eigenvalues of an invertible n n matrix are zero.

193

If all eigenvalues of an n n matrix are pairwise dierent, then the eigenvectors are linearly independent. A symmetric matrix has only real eigenvalues. The eigenvectors of a symmetric matrix are orthogonal. They can be chosen to be orthonormal.

Diagonalization of symmetric matrices Eigenvalue equations: Ax 1 = 1 x 1 . . . Combing all n equations yields

Ax n = n x n

A (x 1 , . . . , x n ) = (1 x 1 , . . . , n x n ) = (x 1 , . . . , x n )

1 .. 0 .

0 n

With Q = (x 1 , . . . x n ) and =

1 .. 0 .

0 n

we get

AQ = Q and A = QQ T .

Theorem 8.1 (Spectral theorem) Every symmetric matrix A Rnn has the factorization A = QQ T . The columns of Q are the eigenvectors. The eigenvectors are orthogonal. is diagonal with the eigenvalues as elements.

Singular Value Decomposition Gilbert Strang writes in [1]: I give you my opinion directly. The SVD is the climax of this linear algebra course. I think of it as the nal step in the Fundamental Theorem. First come the dimensions of the four subspaces. Then their orthogonality. Then the orthonormal bases which diagonalize A. It is all in the formula M = U V T . You have made it to the top. M Rmn hat not full rank. M T M is symmetric, but not invertible.

194 Eigenvalue equation:


2 M T M v i = i vi

8 Function Approximation

T 2 2 2 T vT i M M v i = ||M v i || = i v i v i = i 0.

Thus M T M is positive denite and ||M v i || = i 0. Now


2 M vi M M T M v i = i 2 with the unit eigenvectors shows that M M T has the same eigenvalues i

u i = M v i /i . This leads to (r = rank of M ) M v 1 = 1 u 1 and M v1 . . . vr = ... M v r = r u r 1 0 .. . . 0 r

u1 . . . ur

Adding orthonormal vectors v i from the nullspace of M and orthonormal vectors u i from the nullspace of M T : 0...0 1 ... r . . . M v1 . . . vr . . . vn = u1 . . . ur . . . um . . 0 . . ... 0 0...0 The dimensions of these matrices are (m n) (n n) = (m m)(m n). Written im matrix notation, we get M V = U with the orthogonal matrices V and U and T M = U V T = u 1 1 v T 1 + . . . + u r r v r . The pseudoinverse of M now can easily be computed by M + = V +U T = v 1 with the n m matrix 1 T 1 u1 + . . . + vr uT . 1 r r 0 .. . (8.14)

1/1

+ = 0 . . . 0

1/r 0 .. . . ... 0 . . . ... 0

8.4 Singular Value Decomposition and the Pseudo-Inverse Summary The simplest way to compute the SVD is
U Rmm : Eigenvector matrix of M M T .

195

Rmn being the positive square roots of the eigenvalues of either M M T or M T M . V Rnn : Eigenvector matrix of M T M . Substitute U , V and in equation 8.14 to get M + .

Regularized Version of SVD after applying SVD we get M + = V + U T . = M +y To solve M a = y for a we approximate a

With regularization term: choose a parameter > 0 and solve a = ( I + M + M )1 M + y

Example Find the SVD decomposition of the matrix M = 3 2 2 . 2 3 2 3 2 17 8 2 3 = 8 17 2 2

MMT =

3 2 2 2 3 2

(8.15)

The characteristic polynomial is the determinant |M M T I |. Thus we rst have to calculate M M T I , 17 8 M M T I = (8.16) 8 17 The determinant is |M M T I | = 2 34 + 225 = ( 25)( 9) (8.17)

T 2 2 25 = 5 The eigenvalues of M M are = 25 and = 9. This means in we have = 1 1 2 2 and 2 = 9 = 3. To obtain the eigenvector of M M T for 1 = 25 solve (M M T I )u1 = 0, (M M T 1 I )u1 = 8 8 8 8 u1 = 0 (8.18)

An obvious eigenvector of the previous matrix is (1 1)T . Normalizing this vector we attain 1 T 1 2 u1 = ( ) . For the second eigenvalue 2 = 9, we proceed in the same way and we 2 2 1 1 T will nd that u2 = ( 2 2 ) , is the second eigenvector of M M T . Till now we have found the matrix U and in equation ??. To solve for V use M T M . The eigenvalues of

196

8 Function Approximation

M T M are 25, 9 and 0, and since M T M is symmetric we know that the eigenvectors will be orthogonal. For = 25, we have 12 12 2 M T M 25I = 12 12 2 (8.19) 2 2 17 1 1 0 which row-reduces to 0 0 1 . 0 0 0 1 1 0)T . An eigenvector is v1 = ( 2 2 For = 9, we have 4 12 2 M T M 9I = 12 4 2 (8.20) 2 2 1 1 1 0 4 1 4 )T . . An eigenvector is v2 = ( 1 which row reduces to 0 1 4 1 18 18 18 0 0 0 For the last eigenvector 3 = 0, we could nd a unit vector perpendicular to v1 and v2 or 2 1 T 3 ) . So the full SVD of our solve (M T M 3 I )v3 = 0, then we deduce that v3 = ( 2 3 3 matrix M could now be written as, 1 1 0 1 1 2 2 5 0 0 1 2 2 4 1 M = U V T = 1 1 18 18 18 0 3 0 2 1 2 2 2 3 3 3 The pseudoinverse of M is M + = V + U T =
1 2 1 2 1 18 1 18 4 18 2 3 2 2 1 3

1 0 3 0 0

1 5

1 2 1 2

1 2 1 2

Linear Regression Linear Regression: Estimate Parameters for f (x ) = a1 f1 (x ) + + ak fk (x ) = a T f (x ) Constraints: f (x i ) = a T f (x i ) = yi M a =y with M ij = fj (x i ).

Overdetermined! No Solution Minimize E = ||M a y ||2 Error E on data must become a Minimum: a E = 0 Solution a = (M T M )1 M T y

8.5 Exercises Nonlinear Regression


4

197

Error E on data must become a Minimum: a E = 0 a E = 0 is nonlinear! Solution: Gradient descent! Adjust a in the direction of steepest descent!

-2

-4

-4

-2

8.5

Exercises
M= 8 2 2 2 4 1

Exercise 8.1 Given M ,

a) Perform the SVD decomposition and write M in the form M = U V T . b) Compute the pseudoinverse M + of M . c) Show that M + is a valid (Moore-Penrose) pseudoinverse. d) Show that the pseudoinverse of M , using the technique of the underdertemined system mentioned in section 6.3.8, is the same as the one computed by SVD. Exercise 8.2 Given the following Matrix M ,

3 6 M = 2 4 2 4

a) Show that the pseudoinverse of the matrix M , using the technique of the overdetermined system mentioned in section 6.3.7, is not applicable. b) Perform the SVD decomposition and write M in the form M = U V T . c) Compute the pseudoinverse M + of M . d) Show that M + is a valid pseudoinverse. Exercise 8.3 Prove: a) M + = V + U T is a Moore-Penrose-Pseudoinverse of M . b) + is the pseudoinverse of , i.e. that + = ( T )1 T . Exercise 8.4 Repeat your function approximation experiments from exercise ?? using SVD. Report about your results.

Chapter 9 Numerical Integration and Solution of Ordinary Dierential Equations


9.1 Numerical Integration

Numerical integration is very important in applications, but the analytical (symbolic) integration is always preferable, if possible. The Trapezoidal Rule
y

h a=x 0 x1 ... x i1 xi ... xn =b x

Equidistant partition of [a, b] by x0 = a, x1 = a + h, x2 = a + 2h, ..., xn = a + nh = b (b a) Step size: h = n xi Approximation :


xi1

f (x) dx Area of a trapezoid =h

f (xi1 )+f (xi ) 2

Theorem 9.1 (Trapezoidal Rule) Let f : [a, b] R twice continuously dierentiable . Then it holds
b

f (x) dx = h
a

f (x0 ) f (xn ) + f (x1 ) + ... + f (xn1 ) + T (h) 2 2


T (h)

with

|T (h)|

(b a)h2 max {|f (x)|} x[a,b] 12

9.1 Numerical Integration

199

Proof: From Theorem 6.2 we know that the approximation error for polynomial interpolation of the function f on the n + 1 points x0 , . . . , xn by a polynomial p of degree n is given by f (n+1) (z ) (x x0 )(x x1 ) (x xn ) f (x) p(x) = (n + 1)! for a point z [a, b]. For linear interpolation of f with two points xi1 , xi this yields f ( x) = p ( x) + f (zi ) (x xi1 )(x xi ) 2

for zi [xi1 , xi ].Applying this to the error of the trapezoidal rule on one sub-interval [xi1 , xi ] only we get:
xi xi xi

i = T (h) = T (h)
xi1

f (x) dx = T (h)
xi1 xi

p(x) dx
xi1

f (zi ) (x xi1 )(x xi ) dx 2

f (zi ) 2

(x xi1 )(x xi ) dx.


xi1

Substituting x = xi1 + ht we evaluate


xi 1

(x xi1 )(x xi ) dx = h
xi1

3 0

t(t 1) dt =

h3 6

and get f (zi )h3 . 12 For the trapezoidal rule on the whole interval [a, b] we get i =
n n n

|T (h)| =
i=1

i
i=1

|i | =
i=1

|f (zi )|h3 12

i=1

h3 max {|f (x)|} 12 x[a,b]

(b a)h2 nh3 max {|f (x)|} = max {|f (x)|} x[a,b] 12 x[a,b] 12

and the proof is complete.

Richardson Extrapolation Note: Halving of h (2h h) doubles the computational eort (2n function evaluations). The error is reduced by factor 4: T (2h) 4T (h)

200

9 Numerical Integration and Solution of Ordinary Dierential Equations


y n=2 n=4

2h h a b x

T (h)
a b

f (x)dx + ch2 =
a

f (x)dx + T (h)
b

T (2h)
a

f (x)dx + 4ch =
a

f (x)dx + 4T (h) 1 (T (2h) T (h)) 3 1 (T (2h) T (h)) 3

T (2h) T (h) 3T (h)


b

T (h)

f (x)dx = T (h) T (h) T (h)


b a

4 1 f (x)dx T (h) T (2h) 3 3

This formula gives a better approximation than T(h) and is called Richardson Extrapolation. Repeated Richardson Extrapolation We can generalize the Richardson Extrapolation to any calculation where we know the asymptotic behaviour of some function F to be calculated for h 0 as F (h) = a0 + a1 hp + O(hr ), where a0 = F (0) is the desired value, a1 is unknown and p < r. Suppose we know F for h and qh: F (h) = a0 + a1 hp + O(hr ), F (qh) = a0 + a1 (qh)p + O(hr ), Solving for a0 yields F (0) = a0 = F (h) + F (h) F (qh) + O(hr ) p q 1

This formula leads to a reduction of the error from O(hp ) to O(hr ).

9.1 Numerical Integration

201

Theorem 9.2 If we know the complete expansion of F as F (h) = a0 + a1 hp1 + a2 hp2 + a3 hp3 + . . . , we recursively compute F1 (h) = F (h) and Fk+1 (h) = Fk (h) + Then Fn (h) = a0 + an hpn + an+1 hpn+1 + . . .. An inductive proof can be found e.g. in [27]. The Rhomberg Method It can be shown [27] that for the trapezoidal rule we have
b (n) (n)

Fk (h) Fk (qh) q pk 1

T (h) =
a

f (x)dx + a1 h2 + a2 h4 + a3 h6 + . . .

We apply repeated Richardson extrapolation with q = 2: T1 (h) = T (h) Tk+1 (h) = Tk (h) + k 22k 1 with k = Tk (h) Tk (2h)

Example 9.1 We want to approximate


0.8 0

sin x dx x

and get h T1 (h) 1 /3 0.8 0.758678 0.003360 0.4 0.768757 0.000835 0.2 0.771262 0.000208 0.1 0.771887 T2 (h) 2 /15 T3 (h) 3 /63 T4 (h)

0.77211714 0.00000133 0.77209711 0.00000008 0.77209587


0.8

0.772095771 2.26 1010 0.7720957853 0.772095785485

x The exact solution is 0 sin dx 0.7720957854820. We see that T4 (0.1) is a much better x approximation than T1 (0.1).

Alternative Methods We briey sketch two alternative methods for approximating denite integrals. They are examples of the so called Monte-Carlo methods (they work with random numbers). For many complex applications e.g. the modeling by Dierential Equations is either not possible or too computationally intensive. A solution is the direct simulation of each process using a stochastic model. Such models are used in the areas

202

9 Numerical Integration and Solution of Ordinary Dierential Equations

Static Shysics (Many Particle Physics) Hydrodynamics Meteorology Road Trac Waiting Queue Systems

We give two simple examples of randomized methods for approximating integrals. Method 1 Calculating the area under a curve (see Figure 9.1)[1ex]
y cannon H B f(x)

Figure 9.1: Area calculation using the Monte Carlo Method.


b

f (x)dx
a

Number of hits under the curve BH Number of hits inside the rectangle

Method 2 Following the mean value theorem of integration it holds


b

f (x) dx = (b a) M,
a

(9.1)

where M is the mean of f in the interval [a, b]. Now, we discretize the interval with the given points x1 , . . . , xn and calculate the mean of f on the given points according to 1 A= n
n

f (xi ).
i=1

Due to the denition of the Riemann integral, only for ne discretization A M holds. Therewith M of (9.1) can be replaced by A yielding
b a

ba f (x) dx = n

f (xi ).
i=1

The given points xi should be chosen randomly. (why?) For one-dimensional integrals both presented methods are clearly inferior to the trapezoidal rule. However, in higher dimensions, the advantages show up in the form of much shorter computing times.

9.2 Numerical Dierentiation

203

9.2

Numerical Dierentiation
Goal: compute numerically f (a) at some point x = a Idea: approximate the derivative by a nite dierence quotient (see Figure 9.2):

First Derivative

y f(x) symmetric interval

f(a)

asymmetric interval

ah

a+h

Figure 9.2: Central Dierence. f (x + h) f (x) f (x + h) f (x) h h

f (x) = lim
h 0

h=0

First Derivative: Approximation Error How does the approximation error depend on h?

Taylor Expansion of f in x0 = a: f (a + h) = f (a) + f (a)h + Division by h gives f (a + h) f (a) 1 1 = f (a) + f (a)h + f (a)h2 + . . . = f (a) + O(h) h 2! 3! thus proving Theorem 9.3 Let f : R R two times continuously dierentiable. Then the error of the asymmetric dierence decreases linearly with h, i.e. f (a + h) f (a) = f (a) + O(h). h 1 1 f (a)h2 + f (a)h3 + . . . 2! 3!

204

9 Numerical Integration and Solution of Ordinary Dierential Equations

Central dierence f (x) = lim


h 0

h=0

f (x + h) f (x h) f (x + h) f (x h) 2h 2h

Is the central dierence asymptotically better?

Taylor Expansion of f in x0 = a: 1 1 f (a)h2 + f (a)h3 + . . . 2! 3! 1 1 f (a h) = f (a) f (a)h + f (a)h2 f (a)h3 + . . . 2! 3! f (a + h) = f (a) + f (a)h + Subtracting (9.3) from (9.2) and dividing by 2h leads to f (a + h) f (a h) 1 1 1 = f (a) + f (a)h2 + f (5) (a)h4 + f (7) (a)h6 + . . . 2h 3! 5! 7! = f (a) + O(h2 ) (9.2) (9.3)

thus proving Theorem 9.4 Let f : R R three times continuously dierentiable. Then the error of the symmetric dierence decreases quadratically with h, i.e. f (a + h) f (a h) = f (a) + O(h2 ). 2h Example 9.2 We compute the central dierence with repeated Richardson Extrapolation on the function f (x) = 1/x in x = 1 with h = 0.8, 0.4, 0.2, 0.1, 0.05, 0.025: h 0.8 0.4 0.2 0.1 0.05 0.025 F1 (h) -2.777778 -1.190476 -1.041667 -1.010101 -1.002506 -1.000625 1 /3 0.529101 0.049603 0.010522 0.002532 0.000627 F2 (h) -0.661376 -0.992063 -0.999579 -0.999975 -0.999998 2 /15 F3 (h) F4 (h) F5 (h) F6 (h)

-1.01410935 -1.00008017 -0.999857481 -1.00000105 -0.999999799 -1.00000036 -1.000000016 -0.99999999934 -1.0000000001 -0.9999999998 3 /63 4 /255 5 /1023

-0.0220459 -0.0005010 0.000222685 -0.0000264 0.000001256 -0.0000005581 -0.0000016 0.000000016 -0.0000000008 0.0000000003

9.3 Numerical Solution of Ordinary Dierential Equations Second Derivative ) f (x h ) f (x + h f (x + h) f (x) f (x) + f (x h) 2 2 = lim h0 h0 h h2 f (x + h) 2f (x) + f (x h) h2

205

f (x) = lim

The approximation error can easily be shown to decrease quadratically with h by adding (9.3) to (9.2): 2 (4) 2 (6) f (a + h) 2f (a) + f (a h) 2 2 f ( a ) + f ( a ) h + f (a)h4 + . . . = + h2 2! 4! 6! It can be shown ([27], chapter 7), that, if we (recursively) use symmetric formulas for higher derivatives, the approxiamtion error contains only even powers of h. As a consequence, the same Richardson extrapolation scheme can be applied.

9.3

Numerical Solution of Ordinary Dierential Equations

We will use the common shorthand ODE for ordinary dierential equation. Initial Value Problems for Systems of ODEs Given a function f (x, y ), we want to nd a function y (x) on an interval [a, b] which is an approximate solution of the rst order ODE dy = f (x, y ) with the initial condition y (a) = c dx The order of a dierential equation is the degree of the highest derivative occuring in the equation. If f is linear, then there are symbolic solutions. Many applications can be modelled by systems of rst order ODEs di = i (x, 1 , . . . , s ) (i = 1, . . . , s) dx for the unknown functions 1 (x), . . . , s (x) with the initial contitions i (ai ) = i (i = 1, . . . , s)

Such a system can be written in vector form. With y = (1 (x), . . . , s (x))T c = (1 (x), . . . , s (x))T f = (1 (x), . . . , s (x))T

the systems reads dy = f (x, y ), dx y (a ) = c .

206

9 Numerical Integration and Solution of Ordinary Dierential Equations

Example 9.3 ODEs of higher order can be transformed into a system of rst order ODEs. For the third order ODE d3 y/dx3 = g (x, y, dy/dx, d2 y/dx2 ) with the initial conditions y (0) = 1 , we substitute 1 = y, and get d1 /dx = 2 , 1 (0) = 1 d2 /dx = 3 , 2 (0) = 2 d3 /dx = g (x, y, 1 , 2 , 3 ), 3 (0) = 3 Theorem 9.5 Any system of ODEs can be transformed into an equivalent system of ODEs with derivatives of order one only. 2 = dy/dx, 3 = d2 y/dx2 y (0) = 2 , y (0) = 3

The Euler Method We discretize the interval [a, b] into subintervals of width h by xi = a + ih (i = 0, 1, . . .) and y 0 = y (a) = c and we want to compute the values y 1 , y 2 , . . . as an approximation for the exact values y (x1 ), y (x2 ), . . .. We approximate the system of ODEs by yn +1 yn dy = f (xn , y n ) dx h yielding the recursion y 0 = c, y n+1 = y n + hf (xn , y n ), (n = 1, 2, . . .)

The approximation error of the Euler method can be estimated using the Taylor expansion y (xn+1 ) = y (xn ) + y (xn ) h + The error then is y (xn+1 ) y (xn ) y y 2 y (xn ) = h+ h + .... h 2! 3! One can thus apply Richardson Extrapolation with pk = k . y 2 y 3 h + h + ... 2! 3!

9.3 Numerical Solution of Ordinary Dierential Equations


y y(x)

207

y4

y3 y0 a y1 a+h y2 a+2h a+3h a+4h x

Figure 9.3: Solution polygon of the Euler method..


1.9

xn y ( xn ) 0 1.00 0.1 1.105 0.2 1.221 0.3 1.350 0.4 1.492 0.5 1.649 0.6 1.822

h = 0.1 yn error 1.00 0 1.1 0.005 1.21 0.011 1.331 0.019 1.464 0.028 1.611 0.038 1.772 0.050

h = 0.2 yn error 1.00 0 1.2 1.44 1.728 0.021 0.052 0.094


y

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0 0.1

h=0.1 h=0.2 exp(x)

0.2

0.3 x

0.4

0.5

0.6

Figure 9.4: Results of the Euler method applied to the ODE y = y with y (0) = 1 for h = 0.1 and h = 0.2. Runge-Kutta Methods The error of the Euler method is due to the linear approximation of y (x) in xn as can be seen in Figure 9.3. This can be improved by averaging over an appropriately chosen combination of values of the function f (x, y ). The simplest formula of this type, the Heun Method uses a symmetric average of f (xn ) and f (xn+1 ) with the consequence that (y n+1 y n )/h is eectively used as a symmetric approximation of dy /dx in xn + h/2: yn +1 yn 1 dy = (f (xn , y n ) + f (xn+1 , y n + hf (xn , y n ))) dx h 2 Solving this for y n+1 leads to the recursion scheme k1 = hf (xn , y n ) k2 = hf (xn + h, y n + k1 ) 1 yn +1 = yn + (k1 + k2 ) 2 We use the notion y (x, h) for the numeric result with step width h obtained from applying the recursion scheme. We get a quadratic approximation error y (x, h) = y (x) + c2 (x)h2 + c3 (x)h3 + c4 (x)h4 + . . .

208

9 Numerical Integration and Solution of Ordinary Dierential Equations

with the exponents pk = 2, 3, 4, 5, . . . for Richardson extrapolation. An even better scheme, known as fourth order Runge Kutta or classical Runge Kutta is k1 = hf (xn , y n ) 1 1 k2 = hf (xn + h, y n + k1 ) 2 2 1 1 k3 = hf (xn + h, y n + k2 ) 2 2 k4 = hf (xn + h, y n + k3 ) 1 yn +1 = yn + (k1 + 2k2 + 2k3 + k4 ) 6 with the approximation error y (x, h) = y (x) + c4 (x)h4 + c3 (x)h5 + . . . and pk = 4, 5, 6, . . . Figure 9.5 shows a comparison between the three yet presented methods for solving rst order initial value problems. It clearly conrms the theoretical results wrt. the approximation error which are: Euler method: O(h), Heun method: O(h2 ), Runge Kutta O(h4 ) Euler method yn error 1.00 0 1.1 0.005 1.21 0.011 1.33 0.019 1.46 0.028 1.61 0.038 1.77 0.051 Heun method yn error 1.00 0 1.105 0.00017 1.22103 0.00038 1.34923 0.00063 1.4909 0.00092 1.64745 0.00127 1.82043 0.00169 Runge Kutta yn error 1.00 0 1.10517 8.5 108 1.22140 1.9 107 1.34986 3.1 107 1.49182 4.6 107 1.64872 6.3 107 1.82212 8.4 107

xn 0 0.1 0.2 0.3 0.4 0.5 0.6

y ( xn ) 1.00 1.10517 1.22140 1.34986 1.49182 1.64872 1.82212

Figure 9.5: Comparison of Euler method, Heun method and Runge Kutta applied to the ODE y = y with y (0) = 1 and h = 0.1. Often the selection of an appropriately small step size h is critical for good results of all described methods. This can be automatized with methods that adapt the step size (see [12]). Example 9.4 We want to solve a classical predator prey system from biology. y1 (t) may be a population of sheep and y2 (t) a population of wolves. With no wolves the sheeps breed nicely. Breeding of the wolves increases monotonically with the number of wolves and sheep. But with no sheep, wolves will die out. The ODEs from Lotka-Volterra are [12]: y 1 (t) = y1 (t)(1 y2 (t)) y 2 (t) = y2 (t)(y1 (t) 1)

With the Runge Kutta method we can easily compute the population dynamics for this system. A sample plot is shown in Figure 9.6.

9.3 Numerical Solution of Ordinary Dierential Equations

209

3 2.5 2 1.5 1 0.5 0 0 1 2 t


Figure 9.6: Population dynamics for = 10, t = 0, . . . , 5 h = 0.05. Boundary Value Problems for Second Order ODEs As already mentioned in example 9.3, whenever a second order ODE can be written as y = f (x, y, y ), it can be transformed into a system of two rst order ODEs and then be solved with the methods already described. We will now sketch ideas for a direct solution of scalar second order boundary value problems of the form y = f (x, y, y ) with the boundary conditions y (a) = , y (b) = . We discretize the derivatives by y ( xn ) yn+1 yn1 2h and y (xn ) yn+1 2yn + yn1 h2

y1(t) y2(t)

on the interval [a, b] with b a = mh and xi = a + ih. yi is the approximation of y (xi ). We obtain the (typically nonlinear) system of equations y0 = yn+1 2yn + yn1 = h2 f (xn , yn , ym = . With f = (f1 , . . . , fm1 )T and fn = f (xn , yn , we can write the system in matrix form Ay = h2 f (y ) r (9.4) yn+1 yn1 ) 2h yn+1 yn1 ), 2h (n = 1, 2, 3, . . . m 1)

210 with

9 Numerical Integration and Solution of Ordinary Dierential Equations

2 1 0 0 0 1 2 1 0 0 0 1 2 1 0 A= . , . . . . . . . . . . . . . . 0 0 1 2 1 0 0 0 1 2

y =

y1 y2 y3 . . . . . . ym1

f (y ) =

f1 f2 f3 . . . . . . fm1

0 0 r =. . . . 0

If the dierential equation is linear, this is a linear system that can be solved in linear time with the tridiagonal algorithm described in Section 6.2.2. Since we used symmetric approximation formulas for the derivatives, the approximation error is y (x, h) = y (x) + c1 (x)h2 + c2 (x)h4 + c3 (x)h6 + . . . In the nonlinear case one can use the iterative approach Ay k +1 = h2 f (y k ) r (9.5)

where y k stands for the value of y after k iterations. As initial values one can use a linear interpolation between the two boundary values y0 = y (0) = , ym = y (b) = : y 0 i = + ( ) Multiplication of Equation 9.5 with A1 gives y k +1 = h2 A1 f (y k ) A1 r This is a xed point iteration y k +1 = F (y k ) for solving the xed point equation y = F (y ) with F (y ) = h2 A1 f (y ) A1 r . A generalization of the Banach xed point theorem from Section 5.3.2 can be applied here if F is a contraction. This means, if for any vectors x , y there is a nonnegative real number L < 1 with F (x ) F (y ) L x y , the iteration converges to the unique solution of Equation 9.6 (or equivalently Equation 9.4). The Cart-Pole-Problem (9.6) i . m

9.4 Linear Dierential Equations with Constant Coecients

211

cos + ml 2 sin = 0 (M + m) x ml ) = 0 ml(g sin x cos + l

9.4

Linear Dierential Equations with Constant Coefcients


dy = y dx

To solve the one dimensional rst order ODE1 with the initial value y (0)

we try y (x) = aex and get y (x) = y (0)ex

Systems of Linear Dierential Equations with Constant Coecients To solve dy = Ay with the initial value y (0) dx we try y (x) = u ex Substitution leads to the Eigenvalue problem Au = u
1

(9.7)

We follow section 6.3 in [1]

212 Example To solve

9 Numerical Integration and Solution of Ordinary Dierential Equations

dy 1 2 5 = y with y (0) = 2 1 4 dx we have to solve Au = u and get the characteristic equation (1 )(1 ) 4 = 0 with the solutions 1 = 3 and 2 = 1 and the eigenvectors u1 = The particular solutions are: y 1 ( x) = u 1 e 1 x The linear combinations y (x) = a1 u 1 e1 x + a2 u 2 e2 x represent the subspace of all solutions of equation 9.7. For x = 0 we get y (0) = a1 u 1 + a2 u 2 = (u 1 u 2 ) For the example (equation 9.8) this gives 1 1 1 1 or a1 + a2 = 5 a1 a2 = 4 yielding a1 = 9/2 and a2 = 1/2 and the solution to our initial value problem is y (x) = 9/2 9/2 e3x + 1/2 1/2 ex a1 a2 = 5 4 a1 a2 . and y 2 (x) = u 2 e2 x 1 1 , u1 = 1 1 .

(9.8)

Second order Linear Linear ODEs with Constant Coecients Many mechanical systems can be described by the second order linear ODE2 mx + bx + kx = 0 (9.9)

with x =
dx dt

= derivative wrt. time t

mx = resulting force on point mass m (Newtons Law) bx = friction proportional to speed (damping) kx = elastic restoring force (linear spring)
2

Figure from http://en.wikipedia.org/wiki/File:Mass-Spring-Damper.png

9.4 Linear Dierential Equations with Constant Coecients Transformation to a system of rst order ODEs mx + bx + kx = 0 We substitute x = v and thus x =v and get the rst order system x =v mv + bv + kx = 0 In matrix form: x v
k and = with = m Eigenvalue problem: b . m

213

or

x =v mv = kx bv x v

0 1

(9.10)

=0

Characteristic equation: ( ) + = 2 + + = 0 with the solutions 1,2 = The corresponding eigenvectors are u1 = 1 1 and u 2 = 1 2 . 2 2 . 4

The solutions for the ODE system (9.10) are x v = a1 u 1 e1 t + a2 u 2 e2 t = a1 1 1 e1 t + a2 1 2 e2 t

We only look at the x-component: x(t) = a1 e1 t + a2 e2 t Eigenvalues may be complex: = r + i . Then et = ert+it = ert eit = ert (cos t + i sin t) Since |eit | = (cos2 t + sin2 t) = 1, the real factor ert determines if the solution is stable. Denition 9.1 We call a matrix A stable if all eigenvalues have negative real parts.

214

9 Numerical Integration and Solution of Ordinary Dierential Equations

The complex part cos t + i sin t produces oscillations. Solution is exponential only, if the eigenvalues are real, i.e. if 2 > 0. 4 For > 0 and > 0 this means > 2 or b > 2 km. With = 2bkm we get the solution diagram3

In 2-dimensional x, v -space we get the solutions


1 x1 x2 1 alpha =1 beta =0

0.5

0.5

x2
0 1 2 3 t 4 5 6 7

-0.5

-0.5

-1

-1 -1 -0.5 0 x1 0.5 1

Plot of x(t), v (t) (left) and the x, v phase diagram for = 1, = 0 (right).
1 x1 x2 0.8 alpha =0.5 beta =0.1 0.6

0.5

0.4

0.2

x2
0 5 10 15 20 t 25 30 35 40

-0.2

-0.5

-0.4

-0.6

-1

-0.8 -1 -0.5 0 x1 0.5 1

Plot of x(t), v (t) (left) and the x, v phase diagram for = 0.5, = 0.1 (right). Back to nonlinear ODEs We consider the following system of two nonlinear ODEs:
2 2 y 1 = y1 y2 y1 (y1 + y2 ) 2 2 y 2 = y1 + y2 y2 (y1 + y2 )
3

Figure from http://en.wikipedia.org/wiki/Harmonic_oscillator

9.4 Linear Dierential Equations with Constant Coecients


0.6 x1 x2 0.6 alpha =-0.1

215

0.4

0.4

0.2

0.2

x2
0 5 10 15 20 t 25 30 35 40

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6 -0.6

-0.4

-0.2

0 x1

0.2

0.4

0.6

Plot of y1 (t), y2 (t) (left) and the y1 , y2 phase diagram for = 0.1 (right). Hopf Bifurcation
0.6 x1 x2 0.6 alpha =0.2

0.4

0.4

0.2

0.2

x2
0 5 10 15 20 t 25 30 35 40

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6 -0.6

-0.4

-0.2

0 x1

0.2

0.4

0.6

y1 (t), y2 (t) (left) and the y1 , y2 phase diagram for = 0.2 (right). Hopf Bifurcation
0.6 x1 x2 0.6 alpha =0.2

0.4

0.4

0.2

0.2

x2
0 5 10 t 15 20 25

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6 -0.6

-0.4

-0.2

0 x1

0.2

0.4

0.6

Same setting ( = 0.2), but dierent initial values.

216

9 Numerical Integration and Solution of Ordinary Dierential Equations

Hopf Bifurcation, Properties45 Limit cycle is a stable attractor. Supercritical Hopf bifurcation. < 0: stable dynamics (converges to steady point). 0: unstable dynamics. First Lyapunavo coecient is negative.

Denition 9.2 The appearance or the disappearance of a periodic orbit through a local change in the stability properties of a steady point is known as Hopf bifurcation.

Unstable Attractor We slightly modify the system of ODEs:


2 2 ) + y2 y 1 = y1 y2 +y1 (y1 2 2 y 2 = y1 + y2 +y2 (y1 + y2 )

0.8 x1 x2

0.8 alpha =-0.2

0.6

0.6

0.4

0.4

0.2

0.2

x2
0 5 10 15 20 t 25 30 35 40

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6

-0.8

-0.8 -0.8

-0.6

-0.4

-0.2

0 x1

0.2

0.4

0.6

0.8

= 0.2 and y T (0) = (0, 0.447).


1 x1 x2 0.8 alpha =-0.2

0.6 0.5

0.4

0 0.2

-0.5

x2
0 2 4 6 t 8 10 12 14

-0.2 -1

-0.4

-1.5 -0.6

-2

-0.8 -2 -1.5 -1 -0.5 x1 0 0.5 1

= 0.2 and y T (0) = (0, 0.448).


4 5

www.scholarpedia.org/article/Andronov-Hopf_bifurcation en.wikipedia.org/wiki/Hopf_bifurcation

9.4 Linear Dierential Equations with Constant Coecients Unstable Attractor, Properties Limit cycle is an unstable attractor. Subcritical Hopf bifurcation. < 0: the origin is a stable steady point. 0: unstable dynamics (divergence). First Lyapunavo coecient is positive.

217

The Lorenz Attractor6

x = (y x) y = x( z ) y z = xy z Simple model of atmospheric convection. Chaotic attractor.

The Logistic Equation Similar chaotic dynamics as in the Lorenz attractor can be observed in the following discrete population model:
6

Reproduction proportional to qr qv Xn . Animals die proportional to qd (C Xn ).


en.wikipedia.org/wiki/Lorenz_attractor

218

9 Numerical Integration and Solution of Ordinary Dierential Equations C = capacity of the habitate. Xn+1 = qr qv Xn (C Xn ).

Simplication (C = 1): xn+1 = r xn (1 xn ).

The Logistic Equation, Values r = 2.2000: 0.10000 0.19800 0.34935 0.50007 0.55000 0.54450 0.54564 0.54542 0.54546 0.54545 r = 3.2000: 0.10000 0.28800 0.65618 0.72195 0.64237 0.73514 0.62307 0.75153 0.59754 0.76955 0.56749 ... 0.79945 0.51305 0.79946 0.51304 0.79946 0.51304 r = 3.5000: 0.10000 0.31500 0.75521 0.64703 0.79933 0.56140 0.86181 0.41684 0.85079 0.44431 0.86414 0.41090 0.84721 ... 0.50089 0.87500 0.38282 0.82694 0.50088 0.87500 0.38282 0.82694 The Feigenbaum Diagram7 In the following bifurcation diagram we see the limit values drawn over the parameter value r:

de.wikipedia.org/wiki/Logistische_Gleichung

9.5 Exercises

219

The End Thank you for attending the lectures! Thank you for working hard on the exercises! I wish you fun with Mathematics, with the exercises and with ... I wish you all the best for the exam!!!

9.5
9.5.1

Exercises
Numerical Integration and Dierentiation
xi xi1

Exercise 9.1 Let h = xi xi1 . Calculate the integral substitution x = xi1 + ht with the new variable t.

(x xi1 )(x xi ) dx using the

Exercise 9.2 Write a program for the numerical approximate computation of the integral of a function f in the interval [a, b]. a) Write a function T for the computation of the integral with the trapezoidal rule on an equidistant grid with n equal sub intervals. b) Apply the function T with n and 2n sub intervals to increase the accuracy with Richardsonextrapolation.

220

9 Numerical Integration and Solution of Ordinary Dierential Equations


1

c) Apply your functions to


0

ex dx and produce a table of the approximation error

depending on the step size h (1/20 h 1). d) Show using the above table that the error decreases quadratically for h 0. Exercise 9.3 a) Compute the area of a unit circle using both presented Monte-Carlo methods (naive and mean of function values) to an accuracy of at least 103 . b) Produce for both methods a table of the deviations of the estimated value depending on the number of trials (random number pairs) and draw this function. What can you say about the convergence of this method? c) Compute the volume of four dimensional unit sphere to a relative accuracy of 103 . How much more running time do you need? Exercise 9.4 a) Compute the rst derivative of the function cos x/x in x = 2 with the symmetric dierence formula and h = 0.1. b) Apply Richardson extrapolation to compute F4 (h). c) Compare the error of F4 (h) with the theoretical estimate given in Theorem 9.2. d) Use the table of function values of the function f given beside to approximate the derivative f (x). Apply repeated Richardson extrapolation to get F2 (h), F3 (h) and F4 (h). Plot the resulting functions. 0.5 0.75 1. 1.25 1.5 1.75 2. 2.25 2.5 2.75 3. -3.75 -1.36607 0. 0.729167 1.05 1.10795 1. 0.793269 0.535714 0.2625 0.

9.5.2

Dierential Equations

Exercise 9.5 a) Write programs that implement the Euler-, Heun- and Runge Kutta methods for solving rst order initial value problems. b) Implement the Richardson extrapolation scheme for these methods. Exercise 9.6 The initial value problem dy = sin(xy ) dx y0 = y (0) = 1

is to be solved numerically for x [0, 10]. a) Compare the Euler-, Heun- and Runge Kutta methods on this example. Use h = 0.1. b) Apply Richardson extrapolation to improve the results in x = 5 for all methods. (attention: use the correct pk for each method.) Exercise 9.7 Apply the Runge Kutta method to the predator-prey example 9.4 and experiment with the parameter and the initial values. Try to explain the population results biologically. Exercise 9.8 Use Runge Kutta to solve the initial value problem dy = x sin(xy ) dx y0 = y (0) = 1

9.5 Exercises for x [0, 20]. Report about problems and possible solutions.

221

Exercise 9.9 The following table shows the dierences between the approximations computed with Richardson extrapolation for some numeric algorithm. Determine from the table the convergence order of the algorithm for h 0 and all the exponents pi in the taylor expansion for F (h). (Hint: These dierences are an approximation of the error on the respective approximation level,) h 1 0.5 0.25 0.125 0.0625 0.075433 0.018304 0.004542 0.001133 0.000283

0.0001479 9.106 106 3.492 108 5.670 107 5.409 1010 1.208 1012 3.540 108 8.433 1012 4.691 1015 6.847 1018

Exercise 9.10 (challenging) The dynamics of the inverted pendulum also called cart pole system as shown beside can be described by the following two dierential equations of second order. Here x x , etc. are the rst and second derivatives wrt. the time t. A derivation of these equations can be found on Wikipedia (not required here). cos + ml 2 sin = 0 (M + m) x ml ) = 0 ml(g sin x cos + l (9.11) (9.12)

to obtain a system of 4 rst order a) Use the substitution y1 = x, y2 = x, y3 = , y4 = ODEs of the form y = f (y ). (hint: make sure, the right hand sides of the dierential equations contain no derivatives!) b) Apply the Runge Kutta method to solve the system for g = 9.81, m = 1, m = 1 with the initial condition y1 (0) = 0, y2 (0) = 0, y3 (0) = 0.01, y4 (0) = 0. c) Plot the functions y1 (t), y2 (t), y3 (t), y4 (t) and try to understand them. d) Experiment with other initial conditions and other masses, e.g. m = 1, M = 100000 or M = 1, m = 100000. Exercise 9.11 Prove that, if y 1 and y 2 are solutions of the ODE y = y , then any linear combination of y 1 and y 2 is also a solution. Exercise 9.12 Prove that the eigenvectors of the matrix 0 1 from equation 9.10 with the eigenvalues 1 and 2 are (1, 1 )T and (1, 2 )T . Exercise 9.13 a) Solve the initial value problem mx + bx + kx = 0 with x(0) = 0 and x (0) = 10m/s for 2 the parameters: m = 10kg , b = 2kg/s, k = 1kg/s . Plot the resulting function x(t). b) The general solution involves a complex component i sin t. Does it make sense to have a complex sine-wave as solution for an ODE with real coecients and real initial conditions? What is the natural solution for this problem?

222

9 Numerical Integration and Solution of Ordinary Dierential Equations

Exercise 9.14 Linearize the Lotka-Volterra ODEs and show that this no good model for a predator prey system. To do this: a) Calculate the Jacobian matrix of the right hand side of the ODEs at y (0) and set up the linearized ODEs. b) Calculate the eigenvalues of the Jacobian and describe the solutions of the linearized system. Exercise 9.15 Download the Octave/Matlab code for the Lorenz attractor from http: //en.wikipedia.org/wiki/Lorenz_attractor. Modify the code to dynamically follow a trajectory and observe the chaotic dynamics of the system.

Bibliography
[1] G. Strang. Introduction to linear algebra. Wellesley Cambridge Press, 3rd edition, 2003. 1.1, 5.2, 63, 1 [2] Gilbert Strang. Linear Algebra and its applications. Harcourt Brace Jovanovich College Publishers, 1988. 8.2.2.4 [3] R. Hamming. Numerical Methods for Scientists and Engineers. Dover Publications, 1987. [4] W. Cheney and D. Kincaid. Numerical mathematics and computing. Thomson Brooks/Cole, 2007. [5] S.M. Ross. Introduction to probability and statistics for engineers and scientists. Academic Press, 2009. [6] J. Nocedal and S.J. Wright. Numerical optimization. Springer Verlag, 1999. 8.2.2.4 [7] C.M. Bishop. Pattern recognition and machine learning. Springer New York:, 2006. 7.4 [8] M. Brill. Mathematik f ur Informatiker. Hanser Verlag, 2001. Sehr gutes Buch, das auch diskrete Mathematik beinhaltet. [9] M. Knorrenschild. Numerische Mathematik. Hanser Verlag, 2005. [10] F. Reinhardt and H. Soeder. dtvAtlas zur Mathematik, Band 1 und Band 2: Algebra und Grundlagen. Deutscher Taschenbuchverlag, M unchen, 1977. [11] H. Sp ath. Numerik. Vieweg, 1994. Leicht verst andlich, voraussichtlich werden gr o sere Teile der Vorlesung aus diesem Buch entnommen. [12] H. R. Schwarz. Numerische Mathematik. Teubner Verlag, 1988. Gutes Buch, sehr ausf uhrlich. 5.3.2, 105, 9.4 [13] S. Wolfram. Mathematica, A System for Doing Mathematics by Computer. Addison Wesley, 1991. Das Standardwerk des Mathematica-Entwicklers. Daneben gibt es viele andere B ucher u ber Mathematica. [14] P. J. Fleming and J. J. Wallace. How not to Lie with Statistics: The Correct Way to Summarize Benchmark Results. Comm. of the ACM, 29(3):218221, 1986. [15] J.E. Smith. Characterizing Computer Performance with a Single Number. Communications of the ACM, 31(10):12021206, 1988. [16] J. Acz el. Lectures on Functional Equations and Their Applications, pages 148151, 240244, 291. Academic Press, New York/London, 1966.

224

BIBLIOGRAPHY

[17] W. Ertel. On the Denition of Speedup. In PARLE94, Parallel Architectures and Languages Europe, Lect. Notes in Comp. Sci. 817, pages 289300. Springer, Berlin/New York, 1994. [18] D.E. Knuth. Seminumerical Algorithms, volume 2 of The Art of Computer Programming. Addison-Wesley, 3rd edition, 1997. [19] U. Maurer. A universal statistical test for random bit generators. Journal of Cryptography, 5(2):89105, 1992. 7.1.1 [20] G. Marsaglia. A current view of random number generators. In Computer Science and Statistics: The Interface., pages 310. Elsevier Science, 1985. [21] W. Ertel and E. Schreck. Real random numbers produced by a maxtor disk drive. http://www.hs-weingarten.de/~ertel/rrng/maxtor.html, 2000. 7.1.7 [22] J. von Neumann. Various techniques used in connection with random digits. In von Neumanns Collected Works, volume 5. Pergamon Press, 1963. [23] L. Blum, M. Blum, and M. Shub. A simple unpredictable pseudo-random number generator. SIAM Journal of Computing, 15(2):364383, 1986. 7.1.5.1 [24] M.J.D Powell. Radial basis functions for multivariable interpolation: a review. IMA conference on Algorithms for the Approximation of Function and Data, 1985. 8.2.2.2 [25] Broomhead D.S and Lowe D. Multivariable functional interpolation and adaptive networks. Complex Systems 2, 1988. 8.2.2.4 [26] Wolfgang Ertel. Grundkurs K unstlische Intelligenz. Vieweg and Teubner, 2009. [27] T. Tierney, G. Dahlquist, and A. Bj orck. Numerical Methods. Dover Publication Inc., 2003. 83, 84, 94 [28] M. Li and P. Vitanyi. Two decades of applied kolmogorov complexity. In 3rd IEEE Conference on Structure in Complexity theory, pages 80101, 1988. 7.4 [29] B. Schneier. Angewandte Kryptogrphie. Addison-Wesley, 1996. Deutsche Ubersetzung. 7.1.4 [30] B. Jun and P. Kocher. The intel random number generator (white paper). http: //developer.intel.com/design/security/rng/rngppr.htm, 1999. 7.1.7 [31] Carl Edward Rasmussen and Christopher Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [32] J. Shawe Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. [33] David J. C. MacKay. Bayesian Methods for Adaptive Models. PhD thesis, California Institute of Technology, 1991.

You might also like