Chapter 10 Re-Expressing Data Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

AP Statistics Chapter 10: Re-

Re-expressing Data

 Don’t assume that some re-


re -
expression will always work
Chapter 10  We don’t need a perfect

Re--expressing Data:
Re Data: Get it Straight! model, but a useful one !

It’s easier than you think!


Keep In Mind…

 When we re-express for one reason, we often  Mathematics and calculations are more
end up helping other aspects difficult
 Logarithms straighten out the exponential  Straight lines are easy to understand
trend and pull in the long right tail in the  We know how to think about the slope and
histogram y-intercept
 Helps deal with potential “infinite” quantities
 Leads to simpler models

Benefits Why not just use a curve?

Reading Ch. 10 Quiz Re--expressing Data


Re
5 min (10 points)  Re-expression is another name for changing the
scale of (transforming) the data.
1. Name two situations/reasons we would want to  It’s not cheating!
consider re-expressing a data set.
 We do this on a daily basis
2. What is the Ladder of Powers?
 Ex: bike speed vs. running speed
3. What is one often reliable method to re-express
data and make it more linear?  Bike speed:
speed 15 mph distance
4. Why isn’t it better to simply use a curve to time
model data? time
 Running speed:
speed 6 min. in one mile
5. Name one of the benefits of re-expressing data. distance

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

Re--expressing Data
Re Re--expressing Data
Re
 Re-expressed variables are common in scientific and  Make sure that a re-expression can be meaningful.
social laws and models. Logs, reciprocals, roots, and  Once we re-express, decide if the model is appropriate
inverse squares show up in physics, chemistry,  Create a model
psychology, and economics.  Plot the residuals. If there’s a curve, build another
 Note the difference between creating a model and the model
wisdom of using it. Here, we have to create the model  Once we find a model that has random, unstructured
and then check to see if we should have. We need the residuals, interpret and use it.
residuals to decide whether the model is appropriate, but
 When the appropriate model is found, then
we need the model to fit the residuals.
 Ask how strong is the model
 Look at the pattern
 R2--when interpreting keep in mind that it is still
variability, but it is variability in the re-expressed
variables and NOT the original

Re--expressing Data
Re Re--expressing Data
Re
 Correlation is strength of a linear association so discuss  The residual plot shows us the variation that remains
“r” only if the reexpression makes the relationship linear. undescribed by the model. If the plot appears to be random—
 Residual plots are a signal
signal--and
and--noise issue. A scatterplot just noise– we know we have captured the whole signal. If,
shows the mixture of however, there remains a curve in the residual plot, then we
 signal (the underlying association between the variables)
know we missed some of the signal. The model does not tell
the whole story, so you have to look for a better model.
 noise (the random variation unaccounted for by the association)

Example:
Example:
 In a scatterplot of height and weight, we know that taller people

generally weigh more—that’s the signal. But not all people who are
6 feet tall are the same weight. The variation is the noise. We
assume this variation is random. We seek regression model that
describes the signal—the underlying relationship between height
and weight.

Re--expressing Data
Re Re--expressing Data
Re
Once we have found a model that is appropriate, ask how  To write the correct equation for your model
strong it is.  Pay careful attention to the re-expression use.
 Look at the size of the residuals.
residuals  Just knowing that the coefficients of the linear model are
 Can be misleading when using re-expressions—difficult to 1.2 and 0.55 is NOT enough. If you use logarithmic re-
interpret the actual size of the residuals—we care more about expression, the correct model is not just yˆ  1.2  0.55x , its
the pattern. log( yˆ )  1.2  0.55x
 Be careful about interpreting R2
 Need to know that model represents exponential growth.
 Note that it describes the model’s effectiveness in accounting
 Must be able to make predictions from the equation.
for the variability in re-expressed variables, not the original.
Here, start with a value of x = 2, find log( yˆ )  1.2  0.55(2)  2.3
 Correlation measures the strength of a linear association—
can only talk about r if we find a re-expression that makes the  Now “backsolve” to get yˆ  102.3  199.526  200
relationship linear.

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

Equivalent Models
Type of Re-expression Calculator’s Curve
Model Equation Command Equation

Logarithmic y  a  b log x LnReg y  a  b ln x

y  ab x
Exponential
log y  a  bx ExpReg
Exponential Logarithmic Power
function function log y  a  b log x PwrReg y  ax b
function Power

Equivalent Models Straight to the Point


 We cannot use a linear model unless the relationship
Type of Model Transformation Re--expression
Re between the two variables is linear
linear. Often re-expression
Equation can save the day, straightening bent relationships so that
Model we can fit and use a simple linear model.
Logarithmic yˆ  a  b ln x  log x, y  yˆ  a  b log x  If the relationship is nonlinear (which we can verify by
examining the residual plot)
plot we can try re
re--expressing
the data.
Exponential yˆ  ab x  x,log y  log yˆ  a  bx  To re-express the data, we perform some mathematical
operation on the data values such as taking the
Power yˆ  ax b  log x,log y  log yˆ  a  b log x reciprocal, taking the logarithm , or taking the square
reciprocal
root. Two simple ways to re-express data are with
root
logarithms and reciprocals
reciprocals.
 Re-expressions (change of units, change of scale) can be
seen in everyday life—everybody does it.

For example, consider the relationship If we take the reciprocal of the y-values (as gallons
between the weight of cars (in pounds) and per hundred miles), we get the following scatterplot
and residual plot and eliminate the bend in the
their fuel efficiency (miles per gallon).
original scatterplot.
looks fairly
linear at first

What do the scatterplot and residual plots reveal?


 What do these plots reveal?
A look at the residuals plot shows a problem – a curved
pattern – therefore, linear model is not appropriate.  That the relationship between weight and gal/100 mi
(reciprocal of mpg) is linear.

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

Goals of Re-
Re-expression Goals of Re-
Re-expression
There are several reasons we may want to re-  Goal 1: Make the distribution of a variable (as
express our data: seen in its histogram, for example) more
1) To make the distribution of a variable more symmetric.
symmetric
symmetric.
2) To make the spreads of several groups more
alike.
3) To make the form of a scatterplot more linear.
4) To make the scatter in a scatterplot more
evenly spread .

 The skewed distribution is made much more nearly symmetric by taking


logs.

Goals of Re-
Re-expression Goals of Re-
Re-expression
 Goal 2: Make the spread of several groups (as  Goal 3: Make the form of a scatterplot more
seen in side-by-side boxplots) more alike (not nearly linear.
linear
following like a fan shape), even if their centers
differ.

 The greater value of re-expression to straighten a relationship is that we


 Taking logs makes the individual boxplots more somewhat symmetric can fit a linear model once the relationship is straight. This allows us to
and gives them spreads that are more nearly equal. describe the relationship easier—allows us to use a linear model and all
that goes with it.

Goals of Re-
Re-expression Goals of Re-
Re-expression
 Goal 4: Make the scatter in a scatterplot spread  REMEMBER: The model won’t be perfect, but
out evenly rather than thickening at one end. the re-
re-expression can lead us to a useful
 This can be seen in the two scatterplots we
model.
just saw with Goal 3:  You should recognize when the pattern of the
data indicates that no re-expression can
improve the structure of the data.
 You have to show how to re-express data with
powers and how to find an effective re-
expression for your data using the calculator.
 You should be able to reverse any of the
common re-expressions to put a predicted
 Groups that share a common spread are easier to compare. value or residual back into original units.

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

Goals of Re-
Re-expression PRACTICE
 REMEMBER: The model won’t be perfect, but
the re-
re-expression can lead us to a useful
model.
 You should be able to describe a summary or
display of a re-expressed variable and clearly
indicate how it was re-expressed and give its
re-expressed units.
 You should be able to describe a regression
model fit to re-expressed data in terms of the
re-expressed variables.

PRACTICE The Ladder of Powers


 There is a family of simple re-expressions that move
data toward our goals in a consistent way. This
collection of re-expressions is called the Ladder of
Powers.
Powers
 The Ladder of Powers orders the effects that the re-
expressions have on data.
 Members of the family line up in order.
• The farther you move away from the original data (the “1”
position), the greater the effect on the data .
 This fact allows you to search systematically for a
re-expression that works, either stepping back from
“1” or taking a step towards “1” as you see the
results.

The Ladder of Powers The Ladder of Powers


Power: 2 Power: “0”
 Re-expression: y2  Re-expression: log (y)
 Comment: Use on left-skewed data  Comment: Not really the “0” power. Use on right-
Power: 1 skewed data. Measurements cannot be negative or
 Re-expression: y zero; values that grow by %; when in doubt, start here!
 Comment: This is the raw data. No re-expression. Do Power: -1/2, -1
1 1
not re-express the data if they are already well-  Re-expression: , 
behaved. y y
Power: 1/2  Comment: Use on right-skewed data. Measurements
 Re-expression: y cannot be negative or zero. Use on ratios.
 Comment: Use on count data or when scatter in a NOTE:
scatterplot tends to increase as the explanatory The text lists very specific situations for which each of these might be an appropriate
transformation, but we are not bound by these guidelines, as the ultimate goal is to
variable increases. find a transformation that works!

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

The Ladder of Powers  x, y  consider the


population growth
Power Name Comment in the US.
 We scale the years as we enter the data.
2
Square of Try with unimodal distributions that are We could use 1, 2, 3, 4, … or 0, 25, 50, …
data values skewed to the left. (Caution! Be careful using 0 or negative
Data with positive and negative values numbers as data values. Taking logs of 0
1 Raw data and no bounds are less likely to benefit or negative values can make some points
from re-expression. “go missing” and just disappear quietly
½
Square root of Counts often benefit from a square root from the analysis.)
data values re-expression.
Measurements that cannot be negative We begin with the scatterplot and see a
We’ll use clear curve, concave upward.
“0” (salaries, population) often benefit from a
logarithms here
log re-expression.  The association between year (measured in
Reciprocal An uncommon re-expression, but years since 1800) and U.S. population (in
–½ square root sometimes useful. millions) is strong positive and curved. We
The reciprocal Ratios of two quantities (e.g., mph) often cannot use a regression line to model this
–1
of the data benefit from a reciprocal. relationship without re-expressing the data
first.

consider the
population growth
 x,log y  consider the
population growth
in the US. in the US.
 The change in the new scatterplot is
 Now we use our Ladder of Powers. First we’ll try the zero power, dramatic. The scatterplot of
the logarithm of the population. We start there because we log(population) and year still has a curve
suspect that population might increase by a roughly equal and it bends in the wrong way. We have
percentage each year (and hence that growth is exponential), or gone too far on the Ladder of Powers.
simply because it’s a good place to start if we’re not sure what
to do.  This is a clear indication that we have
gone too far on the ladder and should
retreat toward the original data (the “1”
rung). That suggests the 1/2 power, so we
find the square roots of the populations
and plot them against the years.

 x, y  consider the
population growth
 The scatterplot of sqrt(population) and in the US.  During a science lab, students heated water, allowed it to cool, and
year is still a bit curved, but straight recorded the temperature over time. They computed the difference
enough to fit a line. The model between the water temperature and the room temperature. The
results are in the table.

 Has a high value of R2, and, although the 1) Sketch a scatterplot.


residuals plot shows some pattern, the 2) Newton’s Law of Cooling suggests an exponential function is
residuals are all very small. This is a good appropriate. Reexpress the data using logarithms and sketch
model. a new scatterplot.
3) Write the equation of the least-squares regression line for the
transformed data. Draw the regression line on the scatterplot
in question 2.
4) Use the equation to predict the difference in temperature
after 45 minutes.
5) Use the equation to predict the difference in temperature at
time 0 minutes. What does this value represent?

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

1) Sketch a scatterplot. 2) Newton’s Law of Cooling suggests 3) Write the equation of the least-squares regression line for
an exponential function is the transformed data. Draw the regression line on the
appropriate. Reexpress the data scatterplot in question 2.
using logarithms and sketch a new  x,log y 
scatterplot.
4) Use the equation to predict the difference in temperature
after 45 minutes.

log y  x  10 x  y

5) Use the equation to predict the difference in temperature


at time 0 minutes. What does this value represent?
 This represents the model’s prediction
of the difference in the temperature at
the beginning of the experiment.

 The model predict the mortgage amounts in


1990, 1995, and 2000 to be $42.4 million,
$99.3 million, and $233.0 million,
respectively. These predictions are all much
higher than the actual amounts.
 The model is not valid for these years. .

Re--expressing Data Using Logarithms


Re
 An equation of the form y = a + bx is used to
model linear data.
 The process of transforming nonlinear data into
linear data is called linearization.
2,136 cp  In order to linearize certain types of data we use
properties of logarithms.
14.8 cp

2.36 cp

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

 PROBLEM: We cannot use least-squares Linearizing Exponential Functions:


regression for the nonlinear data because
least-squares regression depends upon (We want to write an exponential function of the form y  a  bx
as a function of the form y  a  bx ).
correlation, which only measures the strength
of linear relationships.
y  a  bx ( x , y are variables and a , b are constants)

 SOLUTION: We transform the nonlinear data


into linear data, and then use least-squares
regression to determine the best fitting line for  This is in the general form y = a + bx
bx, which is
the transformed data. linear.
 Finally, do a reverse transformation to turn  So, the graph of (var1, var2) is linear. This
the linear equation back into a nonlinear means the graph of (x, log y) is linear.
equation which will model our original
nonlinear data.

PROPERTIES OF LOGARITHMS:
CONCLUSIONS:
 If the graph of log y vs. x is linear, then the  1) log( AB ) 
graph of y vs. x is exponential.
 A
 If the graph of y vs. x is exponential, then the  2) log   
graph of log y vs. x is linear. B
 Once we have linearized our data, we can use  3) log x p 
least-squares regression on the transformed
data to find the best fitting linear model.

Plan B: Attack of the Logarithms Plan B: Attack of the Logarithms


 When none of the data values is zero or negative,
logarithms can be a helpful ally in the search for a
useful model.
 Try taking the logs of both the x- and y-variable.
 Then re-express the data using some
combination of x or log(x) vs. y or log(y).

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

Let’s Try It! (p. 233) Let’s Try It! (p. 233)
Shutter speed 1/1000 1/500 1/250 1/125 1/60 1/30 1/15 1/8
f/stop 2.8 4 5.6 8 11 16 22 32
 Scatterplot #1:  Scatterplot #2:
 Shutter speed and  Curved stat plot Xlist→L3, Ylist→L2 Xlist→L1, Ylist→L4
f/stop of the lens  Try logarithms
 L1: shutter speed  Take log of L1→L3
 L2: f/stop  Take log of L2→L4

Let’s Try It! (p. 233) PRACTICE:


 Linearize the Case 1 data and find the least-

squares regression line for the transformed data.


 Use Scatterplot #3:  LinReg L3, L4
LinReg L3, L4
x (mos.) 0 48 96 144 192 240
y ($) 100 161.22 259.93 419.06 675.62 1089.30

log( f / stop)  1.94  0.497log(speed )

PRACTICE:
 Then, do a reverse transformation to turn the
Linearizing Power Functions:
linear equation back into an exponential (We want to write a power function of the form as a function
equation. of the form y  a  bx ).

 Compare this to the equation the calculator gives y  axb ( x , y are variables and a , b are constants)
when performing exponential regression on the
Case 1 data
 This is in the general form y = a + bx, which is linear.

  
So, the graph of log x, log y (var1, var2) is linear.
This means the graph of is linear.

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

Case 2: Consider the following set of Nonlinear Data


representing the average length and weight at different ages
Multiple Benefits
for Atlantic Ocean rockfish:  We often choose a re-expression for one reason
x: age
and then discover that it has helped other aspects
0 4 8 12 16 20
(years) of an analysis.
y: weight  For example, a re-expression that makes a
0 48 192 432 768 1200
(grams)
histogram more symmetric might also straighten a
PRACTICE: scatterplot or stabilize variance.
 Linearize the data for Case 2 and find the least-squares  A single re-expression may improve each of our
regression line for the transformed data. goals at the same time.
 Linearize the data for Case 2 and find the least-squares
 Re-expression certainly simplifies efforts to
regression line for the transformed data.
analyze and understand relationships.
 Then, do a reverse transformation to turn the linear

equation back into a power equation.  Simpler explanations and simpler models tend to
 Compare this to the equation the calculator gives when give a true picture of the relationship.
performing power regression on the Case 2 data.

Why Not Just Use a Curve?


TI Tips
 If there’s a curve in the scatterplot, why not just fit
a curve to the data?
 Regressions that automatically and appropriately
re-express the data:  Benefits to linear approach:
 Contextual meaning of

slope and y-intercept


 More advanced statistical

methods for analyzing


linear associations
 It is usually better to re-
express the data to straighten
the plot.

Why Not Just Use a Curve?


What Can Go Wrong?
 The mathematics and calculations for “curves of
best fit” are considerably more difficult than “lines  Don’t expect your
of best fit.” model to be perfect.
 Besides, straight lines are easy to understand.
 We know how to think about the slope and the  Don’t stray too far
y-intercept. from the ladder.

 Don’t choose a model


based on R2 alone:

RNBriones Concord High


AP Statistics Chapter 10: Re-
Re-expressing Data

What Can Go Wrong? What Can Go Wrong?


 Beware of multiple modes.  Watch out for negative data values.
 Re-expression cannot pull separate modes together.  It’s impossible to re-express negative values by
any power that is not a whole number on the
 Watch out for scatterplots that turn around. Ladder of Powers or to re-express values that are
 Re-expression can straighten many bent zero for negative powers.
relationships, but not those that go up then down, or
down then up.  Watch for data far from 1.
 Data values that are all very far from 1 may not be
much affected by re-expression unless the range
is very large. If all the data values are large (e.g.,
years), consider subtracting a constant to bring
them back near 1.
 Re-expressing data with a range from 1 to 1000 is
far more effective than re-expressing data with a
range of 100,000 to 100,100.

What have we learned? What have we learned? (cont.)


 When the conditions for regression are not met, a  Taking logs is often a good, simple starting point.
simple re-expression of the data may help.  To search further, the Ladder of Powers or the

 A re-expression may make the: log-log approach can help us find a good re-
 Distribution of a variable more symmetric. expression.
 Spread across different groups more similar.  Our models won’t be perfect, but re-expression
 Form of a scatterplot straighter.
can lead us to a useful model.
 Scatter around the line in a scatterplot more

consistent.

RNBriones Concord High

You might also like