Numerical Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

CS450/ECE 491: Introduction to Scientific Computing

❑ Course webpage:
❑ https://relate.cs.illinois.edu/course/cs450-s22/

❑ Quizzes, homework, lecture notes, links to recorded lectures


will be found on the Relate page.

❑ Homework and quiz submissions will also be submitted on the


Relate page.

❑ Exams (2 midterms and a final) will be at the CBTF.


CS450/ECE 491: Introduction to Scientific Computing
❑ Paul Fischer: 4320 Siebel
[email protected]
❑ Off. Hours: 11:30-1 WF (online, for now)

❑ TAs:

Contact us by email if you


have any questions!
Text Book: Scientific Computing, M. Heath

❑ To the extent possible, notation and slides will be


drawn from the text

❑ The lectures will be designed to expand on and


complement the text.

❑ We will use a combination of the text slides plus


additional notes.

❑ Slides will be posted on the Relate page.

❑ Lectures will be recorded.


Scientific Computing

• What is scientific computing?


– Design and analysis of algorithms for numerically solving
mathematical problems in science and engineering
– Traditionally called numerical analysis
• Distinguishing features of scientific computing
– Deals with continuous quantities
– Considers e↵ects of approximations
• Why scientific computing?
– Simulation of natural phenomena
– Virtual prototyping of engineering designs
Simulation Example:
Example: Convective Convective Transport
Transport


@u @u initial conditions
= c +
@t @x boundary conditions

(See Fig. 11.1 in text.)


❑ Examples: ❑ Problem Characteristics:
❑ Ocean currents: ❑ Large (sparse) linear systems: n=106 – 1012
<example:
❑ Pollution
convect demo>
❑ Saline ❑ Demands:
❑ Thermal transport ❑ Accurate approximations
❑ Atmosphere ❑ Fast (low-cost) algorithms
❑ Climate ❑ Stable algorithms
❑ Weather
❑ Industrial processes
❑ Combustion
❑ Automotive engines
❑ Gas turbines
Simulation Example: Convective Transport

o Temperature distribution in hot + cold mixing at T-junction


o 100 million dofs: 20 hours runtime on 16384 cores
o Solve several 108 x 108 linear systems every second
Numerical Simulation

❑ Related course material


❑ Linear systems - chapter 2
❑ Eigenvalues / eigenvectors - chapter 4
❑ Interpolation - chapter 7
❑ Numerical integration/differentiation - chapter 8
❑ Initial value problems - chapter 9
❑ Bounary value problems - chapter 10
❑ Numerical PDEs (simplified) - chapter 11
Data Fitting / Optimization

❑ Examples
❑ Weather prediction (data assimilation)
❑ Image recognition
❑ Etc.

❑ Related course material


❑ Linear / Nonlinear Least Squares - chapter 3
❑ Singular Value Decomposition - chapter 3/4
❑ Nonlinear systems - chapter 5
❑ Optimization - chapter 6
❑ Interpolation - chapter 7
Main Topics / Take-Aways for Chapter 1 (1/2)

• Conditioning of a problem
• Condition number
• Stability of an algorithm
• Errors
– Relative / absolute error
– Total error = computational error + propagated-data error
– Truncation errors
– Rounding errors
• Floating point numbers: IEEE 64
• Floating point arithmetic
– Rounding errors
– Cancellation
Main Topics / Take-Aways for Chapter 1 (2/2)

• Floating point numbers: IEEE 64


• Floating point arithmetic
– Rounding errors
– The Standard Model: f l(a b) = (a b)(1 + ✏)
– Commutativity and associativity
– Cancellation
Scientific Computing Introduction
Approximations Computational Problems
Computer Arithmetic General Strategy

Well-Posed Problems

Problem is well-posed if solution


exists
is unique
depends continuously on problem data

Otherwise, problem is ill-posed

Even if problem is well posed, solution may still be


sensitive to input data

Computational algorithm should not make sensitivity worse

Michael T. Heath Scientific Computing 4 / 46


Scientific Computing Introduction
Approximations Computational Problems
Computer Arithmetic General Strategy

General Strategy

Replace difficult problem by easier one having same or


closely related solution
infinite ! finite
differential ! algebraic
nonlinear ! linear
complicated ! simple

Solution obtained may only approximate that of original


problem

Michael T. Heath Scientific Computing 5 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Sources of Approximation

Before computation
modeling
empirical measurements
previous computations

During computation
truncation or discretization
rounding

Accuracy of final result reflects all these Conditioning!

Uncertainty in input may be amplified by problem

Perturbations during computation may be amplified by


algorithm Stability!

Michael T. Heath Scientific Computing 6 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example: Approximations

Computing surface area of Earth using formula A = 4⇡r2


involves several approximations
Earth is modeled as sphere, idealizing its true shape
Value for radius is based on empirical measurements and
previous computations
Value for ⇡ requires truncating infinite process
Values for input data and results of arithmetic operations
are rounded in computer

Michael T. Heath Scientific Computing 7 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Absolute Error and Relative Error

Absolute error : approximate value true value

absolute error
Relative error :
true value

Equivalently, approx value = (true value) ⇥ (1 + rel error)

True value usually unknown, so we estimate or bound


error rather than compute it exactly

Relative error often taken relative to approximate value,


rather than (unknown) true value

Michael T. Heath Scientific Computing 8 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Data Error and Computational Error

Typical problem: compute value of function f : R ! R for


given argument
x = true value of input
f (x) = desired result
x̂ = approximate (inexact) input
fˆ = approximate function actually computed

Total error: fˆ(x̂) f (x) =

fˆ(x̂) f (x̂) + f (x̂) f (x)


computational error + propagated data error

Algorithm has no effect on propagated data error

Michael T. Heath Scientific Computing 9 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Data Error and Computational Error

Typical problem: compute value of function f : R ! R for


given argument
x = true value of input
f (x) = desired result
x̂ = approximate (inexact) input
fˆ = approximate function actually computed

Total error: || fˆ(x̂) f (x)||=

|| fˆ(x̂) f (x̂)|| + || f (x̂) f (x) ||


computational error + propagated data error

Algorithm has no effect on propagated data error

Michael T. Heath Scientific Computing 9 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Truncation Error and Rounding Error


Truncation error : difference between true result (for actual
input) and result produced by given algorithm using exact
arithmetic
Due to approximations such as truncating infinite series or
terminating iterative sequence before convergence
Rounding error : difference between result produced by
given algorithm using exact arithmetic and result produced
by same algorithm using limited precision arithmetic
Due to inexact representation of real numbers and
arithmetic operations upon them
Computational error is sum of truncation error and
rounding error, but one of these usually dominates

< interactive example >

Michael T. Heath Scientific Computing 10 / 46


2 m!

Truncation Error Example


• Take m = 2:
f (x + h) f (x) 0 h 00
❑ Recall Taylor
h series:
= f (x)
| {z }
+
2
f (⇠)
| (k) {z } | {z }
• If fcomputable desiredonresult
exists (is bounded) [x, x + h], k = 0, . error
truncation . . , m, then there exists
a ⇠ 2 [x, x + h] such that
h 00 h 002
• Truncation error: f
2 0 (⇠) ⇡ 2 fh (x)
00
as h ! 0. hm (m)
f (x + h) = f (x) + hf (x) + f (x) + · · · + f (⇠).
2 m!
h 00 h 00
– To be precise, 2 f (⇠) = 2 f (x) + O(h2 )
• Take m = 2:
This simply says that, as we zoom in (h ! 0), f (x) looks like a line.
f (x + h) f (x) 0 h 00
• Can use the Taylor series=to f
generate(x) + f (⇠) to f 0 (x), f 00 (x),
approximations
| h
{z } | {z } 2
etc., by evaluating f at x, x ±desired
h, x ± result
2h.
computable

• We then solve for the desired derivative and consider lim h ! 0.


• Truncation error: | h2 f 00 (⇠)| ⇡ h2 f 00 (x) as h ! 0.
Q: Suppose |f 00 (x)|
00 ⇡ 1.
Q: Suppose |f (x)| ⇡ 1.
Can we take h = 10 3030and expect
Can we take h = 10 and expect
taylor_demo.m
f (x + h) f (x) 0 10 30
f (x + h) f (x) f (x)
0
 2 10
?30
h f (x)  2 ?
h
h
| computable
{z } | {z }result
desired |2 {z } error
desired result truncation
computable truncation error
Truncation Error Example
h 00 h 00
• Truncation error: h2 f00 (⇠) ⇡h 2 f
00 (x) as h ! 0.
• Truncation error: 2 f (⇠) ⇡ 2 f (x) as h ! 0.
❑ Recall Taylor series:
h 00
–– To
To(k)be
be precise, (⇠)==hh2ff00 (x)
precise, h2ff00 (⇠) 00
(x)++O(h
O(h2 )
2
) k = 0, . . . , m, then there exists
• If f exists (is 2bounded)2 on [x, x + h],
a ⇠ 2 [x, x + h] such that
This simply says
This simply says that,
that,asaswewezoom zoominin(h(h !!0),0),f (x)
f (x) looks
looks likelike a line.
a line.
0 h2 00 hm (m)
f (x + h) = f (x) + hf (x) + f (x) + · · · + f (⇠).
2 m!

• Take m = 2:
• Taylor series
series are
are fundamental
fundamentaltotonumerical
numericalmethods
methods and
and analysis.
analysis.
f (x + h) f (x) h
= f 0 (x) + f 00 (⇠)
| h
{z } | {z } 2
• Newton’s
Newton’s method,
method, optimization
optimization algorithms,
algorithms,
desired result and
andnumerical
numericalsolution of of
solution
computable
di↵erential
di↵erential equations
equationsallallrely
relyon
onunderstanding
understandingthe behavior
the of of
behavior functions
functions
in the neighborhood
neighborhoodofofaaspecific
specificpoint
pointororsetsetofofpoints.
points.
• Truncation error: | h2 f 00 (⇠)| ⇡ h2 f 00 (x) as h ! 0.
• In essence, numerical methods convert calculus from the continuous back
Q: Suppose |f 00 (x)| ⇡ 1.
to the discrete.
30
Can we take h = 10 and expect
• ( A way of avoiding caculus. :) )
f (x + h) f (x) 10 30
f 0 (x)  2 ?
h
a ⇠ 2 [x, x + h] such that
Truncation
0 h2 Error
00 Example
hm
f (x + h) = f (x) + hf (x) + f (x) + · · · + f (m)(⇠).
2 m!
❑ Recall Taylor series:
• Take m = 2:
• If f (k) exists (is bounded) on [x, x + h], k = 0, . . . , m, then there exists
a ⇠f (x + h)
2 [x, x + h]f (x)
such=that f 0 (x) +
h 00
f (⇠)
| h
{z } | {z } 2
| {z } m
2
desired h
result truncation h
h) = f (x) + hf 0 (x) +
f (x +computable f 00 (x) + · · · +error f (m) (⇠).
2 m!
h 00 h 00
• Truncation error: 2 f (⇠) ⇡ 2 f (x) as h ! 0.
• Take m = 2:
h 00 h 00 2
– Tofbe precise,
(x + h) f
2 f
(x)(⇠) = 2 f (x)0 + O(h ) h
= f (x) + f 00 (⇠)
| h
{z } | {z } 2
• Can use the Taylor series desired
computable resultapproximations to f 0 (x), f 00 (x),
to generate
etc., by evaluating f at x, x ± h, x ± 2h.
• Truncation error: | h2 f 00 (⇠)| ⇡ h2 f 00 (x) as h ! 0.
• We then solve for the desired derivative and consider lim h ! 0.
Q: Suppose |f 00 (x)| ⇡ 1.
Q: Suppose |f 00 (x)| ⇡ 1.
Can we take h = 10 30 and expect
Can we take h = 10 30 and expect
f (x + h) f (x) 0 10 30
f (x + h) f (x) f 0
(x)  10 230 ?
h f (x)  2 ?
h
Truncation Error Example

❑ Recall Taylor series:


• If f (k) exists (is bounded) on [x, x + h], k = 0, . . . , m, then there exists
a ⇠ 2 [x, x + h] such that
h2 00
0 hm (m)
f (x + h) = f (x) + hf (x) + f (x) + · · · + f (⇠).
2 m!

• Take m = 2:
f (x + h) f (x) h
= f 0 (x) + f 00 (⇠)
| h
{z } | {z } 2
computable desired result

• Truncation error: | h2 f 00 (⇠)| ⇡ h2 f 00 (x) as h ! 0.

Q: Suppose |f 00 (x)| ⇡ 1.
30
Can we take h = 10 and expect
f (x + h) f (x) 10 30
f 0 (x)  2 ?
h
Truncation Error Example
TaylorTaylor
Series Series
(Very important for SciComp!)
(Very important for SciComp!)
❑ Recall Taylor series:
(
•• If
If f k)
f (k) exists
f ( k)
• exists
If (is
(is bounded)
exists onon
[x,[x,
(is bounded)
bounded) x+ x on
+
h],h], k =kx 0,
[x, = 0,
. . ., .m,
+. h], k. ,=m,0,then
then . .therethere
. , m, exists
then
exists there exist
2 [x,
aa ⇠⇠ 2 [x,
a xx⇠ +
+ h]
2 h] such
[x,such that
x + that
h] such that
2h2 2 mhm m
ff (x
(x ++ h)h) =+ fh)
f(x)
(x)=++fhf
hf 0
(x) h 00 h
(x)++·f· ··(x)·+ h
· ++ f· · f· (m)(⇠). h
f (x= 0
(x)(x) +++ 0
hf (x)
2f f(x)
00
+ 00
m!
(m)
(⇠).
+ f (m)(⇠).
2 2 m! m!

• Take •m
mTake
= 2:
= 2: m = 2:
f (x +f (x
h) + h)
f (x (x)f (x)
+f h) f (x) h h00 00
= = 0 f 0 (x)
f=(x) 0+ + f f(⇠)(⇠) h 00
| | {zh {z h } } | {z |} {z }f| {z
(x) 2 2+
} | {z }
f (⇠)
h
|computable
{z } desired result |2 {zerror
}
computable desired result
desired result Truncation
truncation error
computable truncation error
h 00 00 h 00h 00
•• Truncation error: | 2 f h2(⇠)|
Truncation error: ⇡⇡
f (⇠) 2hf 00(x) as has
f (x) h h
! 0.
00 ! 0.
• Truncation error: 2 f (⇠) ⇡ 2 f (x) as h ! 0.
2

Q: Suppose |f 00 (x)| ⇡h 1.00 h 00 2


– To be precise, 2 f (⇠) =
h 200f (x) +
h O(h
00 ) 2
– To be precise,
30 2 f (⇠) = 2 f (x) + O(h )
Can we take h = 10 and expect
f (x + h) f (x) 10 30
00
Q: Suppose |f (x)| ⇡ 001. f 0 (x)  2 ?
Q: Suppose |f h (x)| ⇡ 1.
30
f (x + 00 h) f (xf+(x)h) f (x) 0 h 00 h
f (x
uppose |f (x)| ⇡ 1. = + h) f (x) f (x) + 0 f h f 0000 (⇠)
(⇠)
| h
{z } h Truncation
| {z =} Error
= f 0
f| {z(x)} Example
(x)
|2 {z+
+ }2 f (⇠)
|| {z
h desired
{z
30 }} result | {z } 2
Can we take h = computable
computable 10 and expect desired
desired result
truncation
result error
computable
f (x + h) f (x) 0 10 30
h 00 f (x) h 00  ? 0.
• Truncation
•• Truncation herror: f h (⇠) 00 ⇡ f (x)
h 00as 2h !
Truncation error: error:2 || h22 ff 00 (⇠)|
(⇠)|2 ⇡
⇡ h22 ff 00 (x)
(x) as
as hh ! ! 0. 0.
h 00 h 00 2

Only ifQ:To
we be precise,
Suppose
can |f
compute 00 f
00 (x)|
2 (⇠)
⇡ =
1.
every f (x) + O(h )
2 term in finite-di↵erence formula
Q: Suppose |f (x)| ⇡ 1.
our algorithm) with sufficient accuracy.
30
Can
Can we take h = 10 30 and
we take h = 10 and expect
expect
00
Q: Suppose |f (x)| ⇡ 1.
ff (x
(x + + h)
h) ff (x)
(x) 00 10 30
f
f (x)
(x) 
 10
2
30
??
Can we take h = 10 h 30
h and expect 2

f (x + h) f (x) 30
f 0 (x)  102 ?
A:
A: Only
Only if
if we
we can
h compute
can compute every
every term
term in
in finite-di↵erence
finite-di↵erence formula
formula
(our
(our algorithm)
algorithm) with
with sufficient
sufficient accuracy.
accuracy.
A: Only if we can compute every term in finite-di↵erence formula
(our algorithm) with sufficient accuracy.
Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example: Finite Difference Approximation


%
!"

"
!"

!%
Round-off Truncation error
!"
error
!$
!"
(/(36+)../.
!#
!"
error
)../.

!&
!"

!!"
!"

!!%
!"

!!$
!" (.0123(,/1+)../. ./014,15+)../.

!!#
!"

!!&
!"
!!# !!$ !!% !!" !& !# !$ !% "
!" !" !" !" !" !" !" !" !"

step'()*+',-)
size, h
Michael T. Heath Scientific Computing 12 / 46
Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example: Finite Difference Approximation

Error in finite difference approximation


0 f (x + h) f (x)
f (x) ⇡
h
exhibits tradeoff between rounding error and truncation
error
Truncation error bounded by M h/2, where M bounds
|f 00 (t)| for t near x
Rounding error bounded by 2✏/h, where error in function
values bounded by ✏
p
Total error minimized when h ⇡ 2 ✏/M
Error increases for smaller h because of rounding error
and increases for larger h because of truncation error
Michael T. Heath Scientific Computing 11 / 46
❑ Round-Off

The computed approximation to the derivative of a C 2 (twice di↵erentiable) function f has


an absolute error satisfying

f h 2✏M
eabs := f 0 . |f 00 | + |f |, (1)
x 2 h
where

f hf (x + h)i hf (x)i
:= (2)
x h
is the one-sided finite di↵erence approximation to f 0 (x) and h·i indicates the truncated float-
ing point approximation to its argument.
Round-o↵ error:
hf (x + h)i = f (x + h)(1 + ✏),
where ✏ is a random variable with |✏|  ✏M .
The error in (1) consists of two parts, the truncation error, T E := h2 |f 00 |, and the round-o↵
error, RE := 2✏hM |f |. For relatively large h, TE dominates, while RE dominates for smaller
h.
In the class notes and in the text we plotted eabs , T E, and RE as a function of h for the case
f (x) = sin(x) and x = 1. We found a minimal realizable error ( ✏M ) and the corresponding
value of h.
Consider now the computed approximation to the second derivative, given by the central-
di↵erence formula,
⌧ 2
f hf (x + h)i 2 hf (x)i + hf (x h)i
:= . (3)
x2 h2
Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example: Finite Difference Approximation


1
%
!"
⇠ hf0 (x)
2
"
!" 2✏M
⇠ f(1x)
Round-off Truncation error
h00
!%
!" ⇠ hf (x) error
2
!$
!"

2 ✏M
(/(36+)../. 1 00
!#
!" ⇠ f (x) ⇠ hf (x)
h 2
error
)../.

!&
!"

2 ✏M
!!"
!" ⇠ f (x)
h
!!%
!"

!!$
!" (.0123(,/1+)../. ./014,15+)../.

!!#
!"

!!&
!"
!!# !!$ !!% !!" !& !# !$ !% "
!" !" !" !" !" !" !" !" !"

step'()*+',-)
size, h
Michael T. Heath Scientific Computing 12 / 46
Round-Off Error
❑ In general, round-off error will prevent us from representing f(x)
and f(x+h) with sufficient accuracy to reach such a result.

❑ Round-off is a principal concern in scientific computing.


(Though once you’re aware of it, you generally know how to avoid
it as an issue.)

❑ Round-off results from having finite-precision arithmetic and finite-


precision storage in the computer. (e.g., how would you store 1/3
in a computer?)

❑ Most scientific computing is done either with 32-bit or 64-bit


arithmetic, with 64-bit being predominant. (IEEE 754-2008)

❑ Round-off error is effectively a source of noise (i.e. perturbation)


at every step of a calculation.
Round-Off Error
❑ We are of course all very familiar with round-off error.

❑ When we perform computation with pencil and paper, we


inevitably truncate infinite-decimal results, and propagate
these forward.

Round-off
error
Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Forward and Backward Error

Suppose we want to compute y = f (x), where f : R ! R,


but obtain approximate value ŷ

Forward error : y = ŷ y

Backward error : x = x̂ x, where f (x̂) = ŷ

Michael T. Heath Scientific Computing 13 / 46


Backward Error Analysis

• User wants: y = f (x).


• User wants:• User wants:y = f (x). y = f (x).
• User wants: y = f (x).
= fˆ(x).
• Algorithm produces: ŷ = fˆ(x).
• Algorithm produces:
• Algorithm ŷproduces: ŷ = fˆ(x).
• Algorithm produces: ŷ = fˆ(x).
• User
• Backward error wants:
analysis asks
• Backward error analysis asks
y = f(
• Backward error analysis asks – Is there an x̂ near x such that f (x̂) = ŷ?
• Algorithm
• Backward –error
Is there produces:
an x̂ near
analysis = fˆf(
asks x suchŷthat
– Is there an x̂ near x such that f –(x̂) = ŷ?an x̂ near x such that f (x̂) =
Is there
• Why is this useful?
•• Why
Backward error analysis asks
is this useful?
– Maybe original data known only to tolerance ✏
– Maybe original
an x̂ data
nearknown only
• Why is this useful? • Why is this–useful?
Is there x such th
– In other words,
– Maybe–original
In other data
words,known only to to
– Maybe original data known onlyWe
tocan’t distinguish✏small |x
tolerance errors x̂|.
induced by the
• Why
algorithm fromWeiscan’t
thisdistinguish
acceptablyuseful? small
small errors errors
in the inpu
– In other algorithm
words, from acceptably small
– Maybe original data known
– In other words,
We can’t distinguish small errors indu
algorithmby
We can’t distinguish small errors induced – Infrom
theacceptably
other words, small errors
algorithm from acceptably small errors in the input.
We can’t distinguish small e
algorithm from acceptably sm
Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example: Forward and Backward Error

p
As approximation to y = 2, ŷ = 1.4 has absolute forward
error

| y| = |ŷ y| = |1.4 1.41421 . . . | ⇡ 0.0142

or relative forward error of about 1 percent


p
Since 1.96 = 1.4, absolute backward error is

| x| = |x̂ x| = |1.96 2| = 0.04

or relative backward error of 2 percent

Michael T. Heath Scientific Computing 14 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Backward Error Analysis

Idea: approximate solution is exact solution to modified


problem

How much must original problem change to give result


actually obtained?

How much data error in input would explain all error in


computed result?

Approximate solution is good if it is exact solution to nearby


problem

Backward error is often easier to estimate than forward


error

Michael T. Heath Scientific Computing 15 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example: Backward Error Analysis

Approximating cosine function f (x) = cos(x) by truncating


Taylor series after two terms gives

ŷ = fˆ(x) = 1 x2 /2

Forward error is given by

y = ŷ y = fˆ(x) f (x) = 1 x2 /2 cos(x)

To determine backward error, need value x̂ such that


f (x̂) = fˆ(x)

For cosine function, x̂ = arccos(fˆ(x)) = arccos(ŷ)

Michael T. Heath Scientific Computing 16 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example, continued

For x = 1,

y = f (1) = cos(1) ⇡ 0.5403


ŷ = fˆ(1) = 1 12 /2 = 0.5
x̂ = arccos(ŷ) = arccos(0.5) ⇡ 1.0472

Forward error: y = ŷ y ⇡ 0.5 0.5403 = 0.0403

Backward error: x = x̂ x ⇡ 1.0472 1 = 0.0472


Relative forward error ~ 8%
Relative backward error ~ 5%

Michael T. Heath Scientific Computing 17 / 46


Backward Error Analysis for Finite Difference Example
r wants: y = f (x).
• User wants: y = f (x).
orithm produces: ˆ(x).
ŷ = fproduces:
• Algorithm ŷ = fˆ(x).

kward error analysis asks error analysis asks


• Backward
Is there an x̂ near– xIssuch
therethat
an x̂f near
(x̂) =
x ŷ?
such
x + hthat f (x̂) = ŷ?
x+h

• By Mean Value Theorem, there exists an x̂ 2 [x, x + h] such that


y is this useful?
• Why is this useful?
– Maybe
Maybe original data f (x to
0 original
known only + data
h) fknown
(x)
tolerance ✏ only|xto tolerance
x̂|. ✏
f (x̂) = .
h

In other words,f is–di↵erentiable


(Assuming In other words,
on [x, x + h].)
We can’t distinguishWe can’t
small distinguish
errors small
induced errors induced by th
by the
algorithm
algorithm from acceptably from
small acceptably
errors in the small
input.errors in the i
• Backward error for the finite di↵erence approximation is bounded by |h|,
assuming no round-o↵ error.
Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Sensitivity and Conditioning

Problem is insensitive, or well-conditioned, if relative


change in input causes similar relative change in solution

Problem is sensitive, or ill-conditioned, if relative change in


solution can be much larger than that in input data

Condition number :
|relative change in solution|
cond =
|relative change in input data|

|[f (x̂) f (x)]/f (x)| | y/y|


= =
|(x̂ x)/x| | x/x|

Problem is sensitive, or ill-conditioned, if cond 1

Michael T. Heath Scientific Computing 18 / 46


Note About Condition Number

❑ It’s tempting to say that a large condition number indicates that a


small change in the input implies a large change in the output.

❑ However, to be dimensionally correct, we need to be more precise.

❑ A large condition number indicates that a small relative change in


input implies a large relative change in the output:
Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Condition Number

Condition number is amplification factor relating relative


forward error to relative backward error
relative relative
= cond ⇥
forward error backward error

Condition number usually is not known exactly and may


vary with input, so rough estimate or upper bound is used
for cond, yielding

relative relative
/ cond ⇥
forward error backward error

Michael T. Heath Scientific Computing 19 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example: Evaluating Function

Evaluating function f for approximate input x̂ = x + x


instead of true input x gives
Absolute forward error: f (x + x) f (x) ⇡ f 0 (x) x
f (x + x) f (x) f 0 (x) x
Relative forward error: ⇡
f (x) f (x)
f 0 (x) x/f (x) xf 0 (x)
Condition number: cond ⇡ =
x/x f (x)

Relative error in function value can be much larger or


smaller than that in input, depending on particular f and x

Michael T. Heath Scientific Computing 20 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Example: Sensitivity

Tangent function is sensitive for arguments near ⇡/2


tan(1.57079) ⇡ 1.58058 ⇥ 105
tan(1.57078) ⇡ 6.12490 ⇥ 104

Relative change in output is quarter million times greater


than relative change in input
For x = 1.57079, cond ⇡ 2.48275 ⇥ 105

Michael T. Heath Scientific Computing 21 / 46


Condition Number Examples

❑ Q: In our finite difference example, where did things go wrong?

x f 0 (x)
Using the formula, cond = , what is
f (x)
the condition number of the following?

f (x) = a x
a
f (x) =
x
f (x) = a + x
Condition Number Examples

x f 0 (x)
cond = ,
f (x)

xa
For f (x) = ax, f 0 = a, cond = ax = 1.

a 0 2 x x2
a
For f (x) = , f = ax , cond = a = 1.
x x

x·1 |x|
For f (x) = a + x, f 0 = 1, cond = a+x = |a+x| .

• The condition number for (a + x) is <1 if a and x are of the same sign,
but it is >1 if they are of opposite sign, and potentially 1 if the are
of opposite sign but close to the same magnitude.
This ill-conditioning is often referred to as cancellation.

• Subtraction of two positive (or negative) values of nearly the same


magnitude is ill-conditioned.
• The condition number for (a + x) is <1 if a and x are of the same sign,
Taylor Series
but it is >1 Condition
(Very
if they Numbersign,
are of important
opposite Examples
for
andSciComp!)
potentially 1 if the are
of opposite sign but close to the same magnitude.
(
• If f
• Subtractionk)ofexists (is bounded)
two positive on values
(or negative) [x, x of
+ nearly
h], k the
= 0,same
. . . , m,
a ⇠ 2 is[x,ill-conditioned.
magnitude x + h] such that
• Multiplication and division are benign.
0 h2 00 hm
f (x + h) = f (x) + hf (x) + f (x) + · · · +
• Addition of two positive (or negative) values is2 also OK. m!
• In our finite di↵erence example, the culprit is the subtraction, more
than the division
• Take m = 2: by a small number.

f (x + h) f (x) 0 h 00
= f (x) + f (⇠).
h 2
Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Stability

Algorithm is stable if result produced is relatively


insensitive to perturbations during computation

Stability of algorithms is analogous to conditioning of


problems

From point of view of backward error analysis, algorithm is


stable if result produced is exact solution to nearby
problem

For stable algorithm, effect of computational error is no


worse than effect of small data error in input

Michael T. Heath Scientific Computing 22 / 46


Scientific Computing Sources of Approximation
Approximations Error Analysis
Computer Arithmetic Sensitivity and Conditioning

Accuracy

Accuracy : closeness of computed solution to true solution


of problem

Stability alone does not guarantee accurate results

Accuracy depends on conditioning of problem as well as


stability of algorithm

Inaccuracy can result from applying stable algorithm to


ill-conditioned problem or unstable algorithm to
well-conditioned problem

Applying stable algorithm to well-conditioned problem


yields accurate solution

Michael T. Heath Scientific Computing 23 / 46


Examples of Potentially Unstable Algorithms

❑ Examples of potentially unstable algorithms include

❑ Gaussian elimination without pivoting

❑ Using the normal equations to solve linear least squares problems

❑ High-order polynomial interpolation with unstable bases (e.g.,


monomials or Lagrange polynomials on uniformly spaced nodes)
agnitude is ill-conditioned.
Multiplication and division are
Unavoidable benign.
Source of Noise in the Input
ddition of two positive (or negative) values is also OK.
❑ Numbers in the computer are represented in finite precision.
n our finite di↵erence example, the culprit is the subtraction, more
Therefore,
han the❑division byunless ournumber.
a small set of input numbers, x, are perfectly
representable in the given mantissa, we already have an error
and our actual input is thus

x̂ = x + x

❑ The next topic discusses the set of representable numbers.

❑ We’ll sometimes refer to this set of “floating point numbers” as F.

❑ We’ll primarily be concerned with two things –


❑ the relative precision,
❑ the maximum absolute value representable.
Floating-Point Numbers
• Floating-point number system is characterized by four integers
base or radix
p precision
(L, U ) exponent range

• Number is represented as
✓ ◆
d1 d2 dp 1 E
x = ± d0 + + 2 + · · · p 1

– Here, we have p digits, d0 , d1 , . . . , dp 1 .


– Each digit is in the range 0  di  1.
– Representations are typically normalized, meaning that d0 6= 0.
– The exponent is in the interval L  E  U.

• Modern computers use base = 2 (i.e., binary representation).


Floating-Point Number Examples

• Base 10 x = ±1.2345678901234567 ⇥ 10±9

• Base 2 x = ±1.0101010101010101 ⇥ 2±1001 (⇥2±9 )

• Base 16 x = ±1.23456789abcdef 01 ⇥ 16±9


Floating-Point Numbers, continued
• Portions of floating-point numbers are designated as
– exponent: E

– mantissa: d0 d1 · · · dp 1

– fraction: d1 d2 · · · dp 1

• Sign, exponent, and mantissa are stored in separate


fixed-width fields of each floating-point word.
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Typical Floating-Point Systems

Parameters for typical floating-point systems


system p L U
IEEE SP 2 24 126 127
IEEE DP 2 53 1022 1023
Cray 2 48 16383 16384
HP calculator 10 12 499 499
IBM mainframe 16 6 64 63

Most modern computers use binary ( = 2) arithmetic

IEEE floating-point systems are now almost universal in


digital computers

Michael T. Heath Scientific Computing 26 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Normalization

Floating-point system is normalized if leading digit d0 is


always nonzero unless number represented is zero

In normalized systems, mantissa m of nonzero


floating-point number always satisfies 1  m <
Reasons for normalization
representation of each number unique
no digits wasted on leading zeros
leading bit need not be stored (in binary system)

Michael T. Heath Scientific Computing 27 / 46


Normalized Mantissa Examples

❑ Decimal:

❑ 1.2345

❑ 9.814 x 10-2

❑ Binary:
Not normalized:
❑ 1.010 x 2-3
.00101
❑ 1.111 x 24
Binary Representation of ⇡
• In 64-bit floating point,

⇡64 ⇡ 1.1001001000011111101101010100010001000010110100011 ⇥ 21

• In reality,

⇡ = 1.10010010000111111011010101000100010000101101000110000100011010 · · · ⇥ 21

• They will (potentially) di↵er in the 53rd bit...

⇡ ⇡64 = 0.00000000000000000000000000000000000000000000000000000100011010 · · · ⇥ 21

Rounding Error
Scientific Computing
Floating-Point Numbers
Approximations
– mantissa: d0 d1 · · · dp
Computer Arithmetic
Floating-Point Arithmetic
1
Properties of Floating-Point Systems
– fraction: d1 d2 · · · dp 1
Floating-point number system is finite and discrete
• Sign, exponent,
Total number of normalized and
floating-point mantissa
numbers is are store
fixed-width fields of each floating-point
2( 1) p 1 (U L + 1) + 1
⇡ 2w , where w=number of bits in a word
Smallest positive normalized number: UFL =
L

Largest floating-point number: OFL = U +1 (1 p)

Floating-point numbers equally spaced only between


successive powers of

Not all real numbers exactly representable; those that are


are called machine numbers
Michael T. Heath Scientific Computing 28 / 46
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Example: Floating-Point System

Tick marks indicate all 25 numbers in floating-point system


having = 2, p = 3, L = 1, and U = 1
OFL = (1.11)2 ⇥ 21 = (3.5)10
1
UFL = (1.00)2 ⇥ 2 = (0.5)10

At sufficiently high magnification, all normalized


floating-point systems look grainy and unequally spaced

< interactive example >


Example: numbers.m

Michael T. Heath Scientific Computing 29 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Rounding Rules
If real number x is not exactly representable, then it is
approximated by “nearby” floating-point number fl(x)
This process is called rounding, and error introduced is
called rounding error
Two commonly used rounding rules
chop : truncate base- expansion of x after (p 1)st digit;
also called round toward zero
round to nearest : fl(x) is nearest floating-point number to
x, using floating-point number whose last stored digit is
even in case of tie; also called round to even
Round to nearest is most accurate, and is default rounding
rule in IEEE systems
< interactive example >
Michael T. Heath Scientific Computing 30 / 46
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Machine Precision

Accuracy of floating-point system characterized by unit


roundoff (or machine precision or machine epsilon)
denoted by ✏mach
With rounding by chopping, ✏mach = 1 p

With rounding to nearest, ✏mach = 1


2
1 p

Alternative definition is smallest number ✏ such that


fl(1 + ✏) > 1

Maximum relative error in representing real number x


within range of floating-point system is given by
fl(x) x
 ✏mach
x

Michael T. Heath Scientific Computing 31 / 46


Rounded Numbers in Floating Point Representation

❑ The relationship on the preceding slide,

can be conveniently thought of as:

fl(x) = x (1 + ✏x)

|✏x|  ✏mach

❑ The nice thing is the expression above has an equality, which is


easier to work with.
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Machine Precision, continued

For toy system illustrated earlier


✏mach = (0.01)2 = (0.25)10 with rounding by chopping
✏mach = (0.001)2 = (0.125)10 with rounding to nearest

For IEEE floating-point systems


✏mach = 2 24
⇡ 10 7
in single precision
✏mach = 2 53
⇡ 10 16
in double precision

So IEEE single and double precision systems have about 7


and 16 decimal digits of precision, respectively

Michael T. Heath Scientific Computing 32 / 46


Advantage of Floating Point

❑ By sacrificing a few bits to the exponent, floating point allows us


to represent a Huge range of numbers….

❑ All numbers have same relative precision.


❑ The numbers are not uniformly spaced.
❑ Many more between 0 and 10 than between 10 and 100!
Relative Precision Example

Let’s look at the highlighted entry from the preceding slide.

x = 3141592653589793238462643383279502884197169399375105820974944.9230781... = ⇡ ⇥ 1060
x̂ = 3141592653589793000000000000000000000000000000000000000000000.0000000... ⇡ ⇡ ⇥ 1060

x x̂ = 238462643383279502884197169399375105820974944.9230781... = 2.3846... ⇥ 1044


16
⇡ .7590501687441757 ⇥ 10 ⇥x
< 1.110223024625157e 16 ⇥ x
⇡ ✏mach ⇥ x.

• The di↵erence between x := ⇡ ⇥ 1060 and x̂ := fl(⇡ ⇥ 1060 ) is large:

x x̂ ⇡ 2.4 ⇥ 1044 .

• The relative error, however, is


x x̂ 2.4 ⇥ 1044 16
⇡ ⇡ 0.8 ⇥ 10 < ✏mach
x ⇡ ⇥ 1060
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Machine Precision, continued

Though both are “small,” unit roundoff ✏mach should not be


confused with underflow level UFL

Unit roundoff ✏mach is determined by number of digits in


mantissa of floating-point system, whereas underflow level
UFL is determined by number of digits in exponent field

In all practical floating-point systems,

0 < UFL⌧< ✏✏mach


0 <UFL mach <⌧ OFL
OFL

Michael T. Heath Scientific Computing 33 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Machine Precision, continued

0 < UFL ⌧ ✏mach ⌧ OFL


Though both are “small,” unit roundoff ✏mach should not be
confused
• Despite with underflow
the remarkably level UFL
small magnitude of UFL, there is a
(superfluous) mechanism to stu↵ more numbers between UFL and 0.
Unit roundoff ✏mach is determined by number of digits in
• They are called denormalized numbers (leading bit 6= 1).
mantissa of floating-point system, whereas underflow level
is determined
UFL(fast)
• Most floating-pointby number
hardware of digits
does in exponent
not support them and field
operations with them are thousands of times slower than without.
In all practical floating-point systems,
• Practically, any denormalized number ⇡ 0, so best not used.
0 < UFL⌧< ✏✏mach
0 <UFL mach <⌧ OFL
OFL

Michael T. Heath Scientific Computing 33 / 46


Summary of Ranges for IEEE Double Precision

p 53 16
p = 53 ✏mach = 2 = 2 ⇡ 10
p 53
p = 53
L ✏mach
1022 = 2 308= 2 ⇡ 10
L = 1022 UF L = 2 = 2 ⇡ 10
L 1022
L = 1022U U F L
1023 = 2 308= 2 ⇡ 10
U = 1023 OF L ⇡ 2 = 2 ⇡ 10
U = 1023 OF L ⇡ 2U = 21023 ⇡ 1030

Q: How many atoms in the Universe?


Q: How many atoms in the Universe?
Q: Howmany
Q: How manyatoms
positive fp64
in the numbers < 1?
Universe?
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Subnormals and Gradual Underflow


Normalization causes gap around zero in floating-point
system
If leading digits are allowed to be zero, but only when
exponent is at its minimum value, then gap is “filled in” by
additional subnormal or denormalized floating-point
numbers

Subnormals extend range of magnitudes representable,


but have less precision than normalized numbers, and unit
roundoff is no smaller
Augmented system exhibits gradual underflow
Michael T. Heath Scientific Computing 34 / 46
Denormalizing: normal(ized) and subnormal numbers

❑ With normalization, the smallest (positive) number you can represent is:

❑ UFL = 1.00000… x 2L = 1. x 2-1022 ~= 10-308

❑ With subnormal numbers you can represent:


❑ x = 0.00000…01 x 2L = 1. x 2-1022-53 ~= 10-324

❑ Q: Would you want to denormalize??

❑ Cost: Often, subnormal arithmetic handled in software à slooooow (1000X).


❑ Number of atoms in universe: ~ 1080
❑ Probably, UFL is small enough.

❑ Similarly, for IEEE DP, OFL ~ 10308 >> number of atoms in universe.
à Overflow will never be an issue (unless your solution goes unstable).
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Exceptional Values

IEEE floating-point standard provides special values to


indicate two exceptional situations
Inf, which stands for “infinity,” results from dividing a finite
number by zero, such as 1/0
NaN, which stands for “not a number,” results from
undefined or indeterminate operations such as 0/0, 0 ⇤ Inf,
or Inf/Inf

Inf and NaN are implemented in IEEE arithmetic through


special reserved values of exponent field

• Note: 0 is also a special number --- it is not normalized.

Michael T. Heath Scientific Computing 35 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Floating-Point Arithmetic

Addition or subtraction : Shifting of mantissa to make


exponents match may cause loss of some digits of smaller
number, possibly all of them

Multiplication : Product of two p-digit mantissas contains up


to 2p digits, so result may not be representable

Division : Quotient of two p-digit mantissas may contain


more than p digits, such as nonterminating binary
expansion of 1/10

Result of floating-point arithmetic operation may differ from


result of corresponding real arithmetic operation on same
operands

Michael T. Heath Scientific Computing 36 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Example: Floating-Point Arithmetic

Assume = 10, p = 6

Let x = 1.92403 ⇥ 102 , y = 6.35782 ⇥ 10 1

Floating-point addition gives x + y = 1.93039 ⇥ 102 ,


assuming rounding to nearest

Last two digits of y do not affect result, and with even


smaller exponent, y could have had no effect on result

Floating-point multiplication gives x ⇤ y = 1.22326 ⇥ 102 ,


which discards half of digits of true product

Michael T. Heath Scientific Computing 37 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Floating-Point Arithmetic, continued

Real result may also fail to be representable because its


exponent is beyond available range

Overflow is usually more serious than underflow because


there is no good approximation to arbitrarily large
magnitudes in floating-point system, whereas zero is often
reasonable approximation for arbitrarily small magnitudes

On many computer systems overflow is fatal, but an


underflow may be silently set to zero

Michael T. Heath Scientific Computing 38 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Example: Summing Series


Infinite series
1
X 1
n
n=1
has finite sum in floating-point arithmetic even though real
series is divergent
Possible explanations
Partial sum eventually overflows
1/n eventually underflows
Partial sum ceases to change once 1/n becomes negligible
relative to partial sum
X1 n 1
1
< ✏mach
n k
k=1

Q: How long would it<take to realize


interactive failure?
example >
Michael T. Heath Scientific Computing 39 / 46
x=1
Standard Model for Floating Point Arithmetic

• Ideally, for x, y 2 F , x flop y = fl(x op y), with op = +, -, / , *.


• This standard met by IEEE.
• Analysis is streamlined using the Standard Model:

fl(x op y) = (x op y)(1 + ), | |  ✏mach ,

which is more conveniently analyzed by backward error analysis.


• For example, with op = +,

fl(x + y) = (x + y)(1 + ) = x(1 + ) + y(1 + ).

• With this type of analysis, we can examine, say, floating-point multipli


cation.

x(1 + x) · y(1 + y) = x · y(1 + x + y + x · y) ⇡ x · y(1 + x + y)


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Floating-Point Arithmetic, continued

Ideally, x flop y = fl(x op y), i.e., floating-point arithmetic


operations produce correctly rounded results

Computers satisfying IEEE floating-point standard achieve


this ideal as long as x op y is within range of floating-point
system

But some familiar laws of real arithmetic are not


necessarily valid in floating-point system

Floating-point addition and multiplication are commutative


but not associative

Example: if ✏ is positive floating-point number slightly


smaller than ✏mach , then (1 + ✏) + ✏ = 1, but 1 + (✏ + ✏) > 1

Michael T. Heath Scientific Computing 40 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Cancellation

Subtraction between two p-digit numbers having same sign


and similar magnitudes yields result with fewer than p
digits, so it is usually exactly representable

Reason is that leading digits of two numbers cancel (i.e.,


their difference is zero)

For example,

1.92403 ⇥ 102 1.92275 ⇥ 102 = 1.28000 ⇥ 10 1

which is correct, and exactly representable, but has only


three significant digits

Michael T. Heath Scientific Computing 41 / 46


Cancellation Example

❑ Cancellation leads to promotion of garbage into “significant” digits

x = 1 . 0 1 1 0 0 1 0 1 b b g g g g e
y = 1 . 0 1 1 0 0 1 0 1 b0 b0 g g g g e
x y = 0 . 0 0 0 0 0 0 0 0 b00 b00 g g g g e
= b00 . b00 g g g g ? ? ? ? ? ? ? ? ? e 9
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Cancellation, continued

Despite exactness of result, cancellation often implies


serious loss of information

Operands are often uncertain due to rounding or other


previous errors, so relative uncertainty in difference may be
large

Example: if ✏ is positive floating-point number slightly


smaller than ✏mach , then (1 + ✏) (1 ✏) = 1 1 = 0 in
floating-point arithmetic, which is correct for actual
operands of final subtraction, but true result of overall
computation, 2✏, has been completely lost

Subtraction itself is not at fault: it merely signals loss of


information that had already occurred
Michael T. Heath Scientific Computing 42 / 46
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Cancellation, continued

Despite exactness of result, cancellation often implies


serious loss of information

Operands are often uncertain due to rounding or other


previous errors, so relative uncertainty in difference may be
large

Example: if ✏ is positive floating-point number slightly


smaller than ✏mach , then (1 + ✏) (1 ✏) = 1 1 = 0 in
floating-point arithmetic, which is correct for actual
operands of final subtraction, but true result of overall
computation, 2✏, has been completely lost
• Of the basic operations,
Subtraction itself is not+at
- *fault:
/ , with arguments
it merely of theloss
signals sameof sign,
only subtraction has cond. number significantly different from
information that had already occurred
unity. Division, multiplication, addition (same sign) are OK.
Michael T. Heath Scientific Computing 42 / 46
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Cancellation, continued

Digits lost to cancellation are most significant, leading


digits, whereas digits lost in rounding are least significant,
trailing digits

Because of this effect, it is generally bad idea to compute


any small quantity as difference of large quantities, since
rounding error is likely to dominate result

For example, summing alternating series, such as

x 2 x 3
ex = 1 + x + + + ···
2! 3!
for x < 0, may give disastrous results due to catastrophic
cancellation

Michael T. Heath Scientific Computing 43 / 46


Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Example: Quadratic Formula


Two solutions of quadratic equation ax2 + bx + c = 0 are
given by p
b ± b2 4ac
x=
2a
Naive use of formula can suffer overflow, or underflow, or
severe cancellation
Rescaling coefficients avoids overflow or harmful underflow
Cancellation between b and square root can be avoided
by computing one root using alternative formula
2c
x= p
b ⌥ b2 4ac
Cancellation inside square root cannot be easily avoided
without using higher precision
< interactive example >
Michael T. Heath Scientific Computing 45 / 46
Scientific Computing
Floating-Point Numbers
Approximations
Floating-Point Arithmetic
Computer Arithmetic

Example: Standard Deviation


Mean and standard deviation of sequence xi , i = 1, . . . , n,
are given by
n
" n
#1
1X 1 X 2
x̄ = xi and = (xi x̄)2
n n 1
i=1 i=1

Mathematically equivalent formula


" n
!# 1
1 X 2
= x2i nx̄2
n 1
i=1

avoids making two passes through data


Single cancellation at end of one-pass formula is more
damaging numerically than all cancellations in two-pass
formula combined
Michael T. Heath Scientific Computing 46 / 46
f (x + h)(1 + ✏+ ) f (x)(1 + ✏0 )
=
h
Finite Di↵erences and Truncation/Round-O↵
f (x + h) f (x) f (xError
+ h)✏+ f (x)✏0
= + ,
• Taylor Series:1 h h
f ✏+ , f✏(x
0 random
+ h) variables of0 arbitrary sign
f (x) h 00with magnitude  ✏M .
:= = f (x) + f (⇠)
x | h
{z } | {z } 2
| {z }
• This computed
computable expression
desiredignores
result round-o↵ in x, h, x + h and division.
truncation error
Can show (by Taylor series expansions) that those errors are benign.
f df
• So, •
expect e :=
Use f (x + h)
abs x = f (x)
dx + hf 0!
= O(h) (⇠)0 to
as h ! 0.
write the first round-o↵ term as:
0
• Computed valuefof(x fx+ h)✏+standard(fmodel,
, using (x) + hf (⇠))✏+ f (x)✏+ f (x)✏+
= = + f 0 (⇠)✏+ ⇡ .
⌧ h h h h
f hf (x + h)i hf (x)i
=
x h
• So, computed approximation
f (x + h)(1 + ✏+ )is f (x)(1 + ✏0 )
=
⌧ h
f f (x + h) +f (x)✏0 f (x + h)✏+
f ✏ f (x)✏0 0 h 00 ✏M
=
= + +f (x) + O(✏+ ) ⇡ ,f (x) + f (⇠) + R f (x).
x x h h h 2 h
– ✏+ , ✏0 random variables of arbitrary sign with magnitude  ✏M .
where |R| < 2.
– This computed expression ignores round-o↵ in x, h, x + h and division.

– Can show (by Taylor series expansions) that those errors are benign.

1
Assuming f , f 0 and f 00 bounded on [x, x + h].
where R is a random variable with |R| < 2.

•• So,
Use computed f (x) + hf 0 (⇠) to
f (x + h) =approximation is write the first round-o↵ term as: 2

f (x + h)✏+ f (x) + hf 0f(⇠))✏+ ✏+ f✏(x)✏
(f 0 + 0 f (x)✏+
= = + = f (x) +
+ fO(✏
(⇠)✏
+ )
+ ⇡ .
h x h x h h h

h 00 ✏M
• Round-o↵ error is thus ⇡ f 0 (x) + f (x) + R f (x) .
|2 {z } | h{z }
f (x + h)✏+ f (x)✏0 ✏+ ✏0 TE ✏+
RE ✏0 ✏M
= f (x) + ✏+ f 0 (⇠) ⇡ f (x) = f (x) R,
h h h h

where R is a random variable with |R| < 2.

• So, computed approximation is



f f ✏+ ✏0
= + f (x) + O(✏+ )
x x h

h 00 ✏M
⇡ f 0 (x) + f (x) + R f (x) .
2
| {z } h
| {z }
TE RE

• Notice, crucially, that the units of each term on the right match.

• This is a good way to check your work.

2
The unknown value of ⇠ is di↵erent from the one in the original Taylor series expansion.
f (x + h)(1 + ✏+ ) f (x)(1 + ✏0 )
=
h
Finite Di↵erences and
f (xTruncation/Round-O↵
+ h) f (x) Error
f (x + h)✏+ f (x)✏0
= + ,
• Taylor Series: 1 h h
✏+ , ✏0f random
f (x +variables
h) f (x) of arbitrary
0
sign with hmagnitude
00
 ✏M .
:= = f (x) + f (⇠)
x | h
{z } | {z } 2
| {z }
• This computed expression
computable ignores
desired round-o↵
result in x, h,error
truncation x+h
and division.
Can show (by Taylor series expansions) that those errors are benign.
f df
• •
Use f (x + h) = f (x) + hf=0 (⇠)
So, expect e abs := x dx O(h) ! 0 asthe
to write ! 0. round-o↵
h first term as:

f (xvalue
+ h)✏ 0
• Computed of+ fx , using (f (x) + hf (⇠))✏+ f (x)✏+ f (x)✏+
= standard model, = + f 0 (⇠)✏+ ⇡ .
h⌧ h h h
f hf (x + h)i hf (x)i
=
x h
• So, computed approximation is + ✏+ ) f (x)(1 + ✏0 )
f (x + h)(1
=
⌧ h
f f ✏
f (x ✏
+ + h) 0 f (x) f (x + h)✏+ 0f (x)✏0 h 00 ✏M
= =+ f (x) + + O(✏+ ) ⇡ f (x) + , f (⇠) + R f (x).
x x h h h 2 h
– ✏+ , ✏0 random variables of arbitrary sign with magnitude  ✏M .
where |R| < 2.
– This computed expression ignores round-o↵ in x, h, x + h and division.

– Can show (by Taylor series expansions) that those errors are benign.

1
Assuming f , f 0 and f 00 bounded on [x, x + h].

You might also like