Chapter 10 Re-Expressing Data Notes
Chapter 10 Re-Expressing Data Notes
Chapter 10 Re-Expressing Data Notes
Re-expressing Data
Re--expressing Data:
Re Data: Get it Straight! model, but a useful one !
When we re-express for one reason, we often Mathematics and calculations are more
end up helping other aspects difficult
Logarithms straighten out the exponential Straight lines are easy to understand
trend and pull in the long right tail in the We know how to think about the slope and
histogram y-intercept
Helps deal with potential “infinite” quantities
Leads to simpler models
Re--expressing Data
Re Re--expressing Data
Re
Re-expressed variables are common in scientific and Make sure that a re-expression can be meaningful.
social laws and models. Logs, reciprocals, roots, and Once we re-express, decide if the model is appropriate
inverse squares show up in physics, chemistry, Create a model
psychology, and economics. Plot the residuals. If there’s a curve, build another
Note the difference between creating a model and the model
wisdom of using it. Here, we have to create the model Once we find a model that has random, unstructured
and then check to see if we should have. We need the residuals, interpret and use it.
residuals to decide whether the model is appropriate, but
When the appropriate model is found, then
we need the model to fit the residuals.
Ask how strong is the model
Look at the pattern
R2--when interpreting keep in mind that it is still
variability, but it is variability in the re-expressed
variables and NOT the original
Re--expressing Data
Re Re--expressing Data
Re
Correlation is strength of a linear association so discuss The residual plot shows us the variation that remains
“r” only if the reexpression makes the relationship linear. undescribed by the model. If the plot appears to be random—
Residual plots are a signal
signal--and
and--noise issue. A scatterplot just noise– we know we have captured the whole signal. If,
shows the mixture of however, there remains a curve in the residual plot, then we
signal (the underlying association between the variables)
know we missed some of the signal. The model does not tell
the whole story, so you have to look for a better model.
noise (the random variation unaccounted for by the association)
Example:
Example:
In a scatterplot of height and weight, we know that taller people
generally weigh more—that’s the signal. But not all people who are
6 feet tall are the same weight. The variation is the noise. We
assume this variation is random. We seek regression model that
describes the signal—the underlying relationship between height
and weight.
Re--expressing Data
Re Re--expressing Data
Re
Once we have found a model that is appropriate, ask how To write the correct equation for your model
strong it is. Pay careful attention to the re-expression use.
Look at the size of the residuals.
residuals Just knowing that the coefficients of the linear model are
Can be misleading when using re-expressions—difficult to 1.2 and 0.55 is NOT enough. If you use logarithmic re-
interpret the actual size of the residuals—we care more about expression, the correct model is not just yˆ 1.2 0.55x , its
the pattern. log( yˆ ) 1.2 0.55x
Be careful about interpreting R2
Need to know that model represents exponential growth.
Note that it describes the model’s effectiveness in accounting
Must be able to make predictions from the equation.
for the variability in re-expressed variables, not the original.
Here, start with a value of x = 2, find log( yˆ ) 1.2 0.55(2) 2.3
Correlation measures the strength of a linear association—
can only talk about r if we find a re-expression that makes the Now “backsolve” to get yˆ 102.3 199.526 200
relationship linear.
Equivalent Models
Type of Re-expression Calculator’s Curve
Model Equation Command Equation
y ab x
Exponential
log y a bx ExpReg
Exponential Logarithmic Power
function function log y a b log x PwrReg y ax b
function Power
For example, consider the relationship If we take the reciprocal of the y-values (as gallons
between the weight of cars (in pounds) and per hundred miles), we get the following scatterplot
and residual plot and eliminate the bend in the
their fuel efficiency (miles per gallon).
original scatterplot.
looks fairly
linear at first
Goals of Re-
Re-expression Goals of Re-
Re-expression
There are several reasons we may want to re- Goal 1: Make the distribution of a variable (as
express our data: seen in its histogram, for example) more
1) To make the distribution of a variable more symmetric.
symmetric
symmetric.
2) To make the spreads of several groups more
alike.
3) To make the form of a scatterplot more linear.
4) To make the scatter in a scatterplot more
evenly spread .
Goals of Re-
Re-expression Goals of Re-
Re-expression
Goal 2: Make the spread of several groups (as Goal 3: Make the form of a scatterplot more
seen in side-by-side boxplots) more alike (not nearly linear.
linear
following like a fan shape), even if their centers
differ.
Goals of Re-
Re-expression Goals of Re-
Re-expression
Goal 4: Make the scatter in a scatterplot spread REMEMBER: The model won’t be perfect, but
out evenly rather than thickening at one end. the re-
re-expression can lead us to a useful
This can be seen in the two scatterplots we
model.
just saw with Goal 3: You should recognize when the pattern of the
data indicates that no re-expression can
improve the structure of the data.
You have to show how to re-express data with
powers and how to find an effective re-
expression for your data using the calculator.
You should be able to reverse any of the
common re-expressions to put a predicted
Groups that share a common spread are easier to compare. value or residual back into original units.
Goals of Re-
Re-expression PRACTICE
REMEMBER: The model won’t be perfect, but
the re-
re-expression can lead us to a useful
model.
You should be able to describe a summary or
display of a re-expressed variable and clearly
indicate how it was re-expressed and give its
re-expressed units.
You should be able to describe a regression
model fit to re-expressed data in terms of the
re-expressed variables.
consider the
population growth
x,log y consider the
population growth
in the US. in the US.
The change in the new scatterplot is
Now we use our Ladder of Powers. First we’ll try the zero power, dramatic. The scatterplot of
the logarithm of the population. We start there because we log(population) and year still has a curve
suspect that population might increase by a roughly equal and it bends in the wrong way. We have
percentage each year (and hence that growth is exponential), or gone too far on the Ladder of Powers.
simply because it’s a good place to start if we’re not sure what
to do. This is a clear indication that we have
gone too far on the ladder and should
retreat toward the original data (the “1”
rung). That suggests the 1/2 power, so we
find the square roots of the populations
and plot them against the years.
x, y consider the
population growth
The scatterplot of sqrt(population) and in the US. During a science lab, students heated water, allowed it to cool, and
year is still a bit curved, but straight recorded the temperature over time. They computed the difference
enough to fit a line. The model between the water temperature and the room temperature. The
results are in the table.
1) Sketch a scatterplot. 2) Newton’s Law of Cooling suggests 3) Write the equation of the least-squares regression line for
an exponential function is the transformed data. Draw the regression line on the
appropriate. Reexpress the data scatterplot in question 2.
using logarithms and sketch a new x,log y
scatterplot.
4) Use the equation to predict the difference in temperature
after 45 minutes.
log y x 10 x y
2.36 cp
PROPERTIES OF LOGARITHMS:
CONCLUSIONS:
If the graph of log y vs. x is linear, then the 1) log( AB )
graph of y vs. x is exponential.
A
If the graph of y vs. x is exponential, then the 2) log
graph of log y vs. x is linear. B
Once we have linearized our data, we can use 3) log x p
least-squares regression on the transformed
data to find the best fitting linear model.
Let’s Try It! (p. 233) Let’s Try It! (p. 233)
Shutter speed 1/1000 1/500 1/250 1/125 1/60 1/30 1/15 1/8
f/stop 2.8 4 5.6 8 11 16 22 32
Scatterplot #1: Scatterplot #2:
Shutter speed and Curved stat plot Xlist→L3, Ylist→L2 Xlist→L1, Ylist→L4
f/stop of the lens Try logarithms
L1: shutter speed Take log of L1→L3
L2: f/stop Take log of L2→L4
PRACTICE:
Then, do a reverse transformation to turn the
Linearizing Power Functions:
linear equation back into an exponential (We want to write a power function of the form as a function
equation. of the form y a bx ).
Compare this to the equation the calculator gives y axb ( x , y are variables and a , b are constants)
when performing exponential regression on the
Case 1 data
This is in the general form y = a + bx, which is linear.
So, the graph of log x, log y (var1, var2) is linear.
This means the graph of is linear.
equation back into a power equation. Simpler explanations and simpler models tend to
Compare this to the equation the calculator gives when give a true picture of the relationship.
performing power regression on the Case 2 data.
A re-expression may make the: log-log approach can help us find a good re-
Distribution of a variable more symmetric. expression.
Spread across different groups more similar. Our models won’t be perfect, but re-expression
Form of a scatterplot straighter.
can lead us to a useful model.
Scatter around the line in a scatterplot more
consistent.